How to Use the Parquet Validator: Step-by-Step Tutorial
The Parquet Validator runs entirely in your browser โ your data file is never sent to any server, no account is required, and nothing leaves your machine. This tutorial walks through every step of using the tool: loading a Parquet file, reading each results panel, understanding what the validator checks, and diagnosing the issues it reports.
Follow along with the tool open: Open the Parquet Validator in a second tab, then work through each step below.
Open Parquet Validator โTable of Contents
Step 1 โ Open the Tool
Navigate to /developer-tools/parquet-validator/. The tool loads entirely in the browser โ after the initial page load, validating a file makes zero outbound network requests. Only the Parquet footer (typically a few kilobytes at the end of the file) is read for most checks; row data is never decoded or transmitted. You can confirm this in your browser's DevTools Network panel: drop a file and watch the network tab remain idle.
The tool is accessible from the Developer Tools hub, the command palette (press Ctrl+K or โK and type "Parquet Validator"), or directly via the URL above.
Step 2 โ Load Your Parquet File
Drag your .parquet, .pq, or .parq file from your file manager and drop it anywhere on the drop zone. The drop zone is labeled with an icon and text prompting you to drop a Parquet file. The tool also supports a full-page drag target โ dragging a file anywhere over the browser window (not just the drop zone) will trigger the large overlay drop indicator and accept the file when released.
If you drop a file with an unrecognised extension, a type error message appears identifying the actual file extension and explaining why it was rejected. Only .parquet, .pq, and .parq files are accepted by extension check.
Once a valid file is dropped, the file name appears in a pill below the drop zone and validation runs automatically. Results appear within a second for typical Parquet files regardless of their total size, because only the footer is read.
Removing a file: Once a file is loaded, an โ button appears next to the filename. Click it to clear the file and reset all result panels so you can load a different file.
Step 3 โ Read the Status Bar
After validation runs, a status bar appears. It is colour-coded to give you an immediate answer:
- Green โ "Valid Parquet โ no issues found." The file passed all checks: both magic byte markers present, footer parseable, schema valid, row group structure intact, all encodings and compression codecs recognised.
- Yellow โ "Valid Parquet โ N warning(s). See details below." The file is structurally valid but one or more conditions were flagged โ for example, deprecated INT96 timestamp columns, or unusual encoding declarations. See the Warnings panel for specifics.
- Red โ "Validation failed โ N error(s) found." The file failed one or more required checks. The error count is shown in the status bar; the full list appears in the Error panel below.
Step 4 โ Review the Error Panel
If validation fails, the Error panel appears with a red header listing every error detected. Common error messages and what they mean:
"Magic bytes missing at start of file"
The first four bytes are not PAR1 (hex: 50 41 52 31). The file is not a Parquet file โ it may be ORC, Avro, CSV, Feather, or another format entirely, or it may be a Parquet file that was truncated before any data was written. Confirm the actual format with a hex editor or the file command: file yourfile.parquet. If it does not return "Apache Parquet", the file needs to be regenerated from the source.
"Magic bytes missing at end of file (file may be truncated)"
The opening magic bytes are correct but the closing PAR1 marker (the last four bytes) is absent or incorrect. This is the most common failure mode for Parquet files produced by jobs that crashed after writing row group data but before finalising the file. The footer was never written, so the file cannot be read. The job needs to be re-run from the start or from a checkpoint before the failed write.
"Footer length is invalid or exceeds file size"
The four bytes immediately before the closing magic marker declare the footer length, but the value is negative, zero, or larger than the remaining file contents allow. This indicates the footer length field was written incorrectly, likely due to a partially-written file or a bug in the Parquet writer. The file is not recoverable without re-generating it from the source data.
"Failed to parse footer (Thrift deserialization error)"
The footer bytes were read successfully but the Thrift deserialisation failed. The footer is corrupt โ possibly due to filesystem errors, storage media degradation, or an in-flight write that was interrupted partway through the footer bytes. Re-generate the file from the source.
"Schema is empty or missing"
The footer parsed successfully but the schema element list is absent or contains no columns. A Parquet file with no schema cannot be read by any query engine. This usually indicates a writer that initialised the footer structure but did not populate the schema before writing it.
Step 5 โ Review the Warnings Panel
If warnings were detected, a yellow Warnings panel lists each one. Warnings indicate valid-but-noteworthy conditions:
"Column 'X' uses deprecated INT96 type"
One or more timestamp columns use the INT96 physical type, which was an early Parquet convention for timestamps before the TIMESTAMP logical type annotation was standardised. INT96 is deprecated in the Parquet specification and is not supported by all modern query engines โ notably some versions of DuckDB, BigQuery external tables, and certain Spark configurations require INT64 with a TIMESTAMP logical type instead. Re-encode the file with a current version of PyArrow, Spark, or another Parquet writer that defaults to INT64 timestamps.
"No row groups found"
The footer parsed and the schema is valid, but the row group list is empty. A Parquet file with zero row groups is technically valid per the specification but carries no data. This can result from a job that completed with an empty dataset โ confirm whether the source data was empty, or whether the job silently failed before writing any rows.
"Unknown compression codec code: [N]"
A column chunk declares a compression codec code that is not in the known Parquet codec registry. This may indicate a very new codec not yet reflected in the validator, or a corrupt column chunk metadata field. Check which version of the Parquet specification your writer targets and whether your query engine supports the codec in question.
Step 6 โ Read the File Stats
When a file passes validation (green or yellow status), a teal stats panel appears with key metadata extracted from the footer:
- Total Rows. The sum of row counts across all row groups. This is the total number of records in the file.
- Columns. The number of leaf columns (data columns) in the schema. Nested types count each leaf separately.
- Row Groups. The number of horizontal partitions in the file. More row groups means finer-grained predicate pushdown but more footer overhead. Typical Parquet files for analytical workloads use row groups of 64 MBโ512 MB.
- Compression. The compression codec(s) declared across column chunks. If multiple codecs are used, the most common one is shown.
- Version. The Parquet format version declared in the footer โ typically 1 or 2. Format version 2 enables additional features including the DELTA and BYTE_STREAM_SPLIT encodings.
- File Size. The total file size in KB or MB.
- Footer Size. The size of the serialised footer in bytes. Large footers (several MB) indicate either a very large schema, many row groups, or extensive key-value metadata.
- KV Metadata. The number of key-value metadata entries in the footer. These are arbitrary string pairs written by the Parquet writer โ Arrow schema information, Spark job metadata, and custom application metadata are commonly stored here.
- Created by. The creator string from the footer, identifying the application and version that wrote the file โ for example, "parquet-cpp-arrow version 14.0.0" or "parquet-mr version 1.12.3".
Step 7 โ Inspect the Schema
The Schema panel (teal header, labelled "Schema (N columns)") shows the complete list of leaf columns extracted from the footer's schema element tree. Each row in the table shows:
- #. Column index (1-based).
- Column Name. The column's name as declared in the schema. Nested column names are shown using dot notation where applicable.
- Physical Type. The low-level storage type โ BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BYTE_ARRAY, or FIXED_LEN_BYTE_ARRAY. INT96 columns are highlighted in red as a deprecation warning.
- Repetition. REQUIRED (no nulls), OPTIONAL (nullable), or REPEATED (list element).
- Logical Type. The higher-level semantic annotation โ STRING, DATE, TIMESTAMP, DECIMAL, LIST, MAP, and others. Columns without a logical type annotation show "โ".
Use the schema panel to verify column names, types, and nullability match your expectations before loading the file into a query engine. A column that should be OPTIONAL (nullable) appearing as REQUIRED, or a timestamp column using INT96 instead of INT64, are common discrepancies worth catching before the file enters production.
Step 8 โ Review Row Groups
The Row Groups panel (teal header, labelled "Row Groups (N)") shows the structure of each horizontal partition in the file, up to the first 50 row groups. Each row shows:
- #. Row group index (1-based).
- Rows. The number of records in this row group.
- Compressed Size. The total compressed byte size of all column chunks in this row group.
If the file has more than 50 row groups, a note at the bottom of the panel indicates how many additional row groups exist beyond those shown.
Use this panel to check for row group size imbalances. A file where most row groups contain 100,000 rows but one contains 3 rows is likely the result of a small final batch or a job that restarted and appended a tiny residual partition. Rebalanced row groups improve scan performance for analytical queries by reducing the number of predicate-pushdown decisions the query engine needs to make.
Worked Examples
Example 1: Spark job output that a downstream job cannot read
Situation: A Spark job writes Parquet output to S3. A downstream job tries to read it and fails with a cryptic error.
What to do: Download one of the output files from S3 and drop it into the Parquet Validator. If the status bar shows red with "Magic bytes missing at end of file", the Spark job failed partway through writing and the footer was never written โ check the Spark job logs for an executor failure or OOM error. If the status is green but the downstream job still fails, check the Schema panel: the physical types and logical type annotations may be incompatible with the query engine the downstream job uses (for example, INT96 timestamps in a pipeline that expects INT64).
Example 2: Validating a vendor-supplied data file
Situation: A data vendor sends you a weekly Parquet file. You need to confirm it has the expected schema and row count before loading it into your warehouse.
What to do: Drop the file into the Parquet Validator. Check the Total Rows stat against the row count the vendor reported. Then check the Schema panel โ confirm the column names, physical types, and logical types match your warehouse table definition. If a column that should be STRING appears as BYTE_ARRAY without a logical type annotation, or an expected column is missing entirely, flag it with the vendor before loading.
Example 3: Diagnosing a "footer too large" performance complaint
Situation: Engineers report that reading a particular Parquet file is much slower than expected, even when filtering to a small subset of columns.
What to do: Drop the file into the Parquet Validator and check the Footer Size stat. If the footer is several megabytes, the file likely has either a very large number of row groups (common when a file is the result of appending many small batches without compaction) or extensive key-value metadata. Check the Row Groups count โ if it is in the thousands for what should be a single-partition file, the file needs to be compacted into fewer, larger row groups using a Parquet rewriting job.
Example 4: Checking compression compatibility before deployment
Situation: Your pipeline was updated to write Parquet files with ZSTD compression. You need to confirm a new file uses ZSTD before deploying the reader to a production environment that supports it.
What to do: Drop the file into the Parquet Validator and check the Compression stat in the file stats panel. If it shows ZSTD, the writer is working as expected. If it shows SNAPPY or UNCOMPRESSED, the compression configuration was not applied correctly. Also check the Schema panel โ if any column still shows a different codec in the row group details, there may be a per-column codec override in the writer configuration.
Tips and Edge Cases
- Partitioned datasets. Each file in a partitioned Parquet dataset (e.g., a Hive-partitioned directory on S3) is a self-contained Parquet file and can be validated independently. Validate one representative file per partition to spot-check schema consistency across partitions.
- Very large files. Because only the footer is read, files of any size validate quickly โ a 10 GB Parquet file with a small footer validates in the same time as a 1 MB file. The file is never fully loaded into browser memory.
- Encrypted Parquet. Parquet files encrypted with the Modular Encryption spec (added in Parquet format version 2) may have an encrypted footer. The validator will be unable to parse the footer and will report a Thrift deserialization error. This is expected behaviour โ the validator does not support encrypted files.
- Delta Lake / Iceberg files. Parquet files inside Delta Lake or Iceberg tables are standard Parquet files and can be validated normally. The Delta log or Iceberg manifest is separate from the Parquet file structure itself.
- After fixing a file. If you re-write or recompress a Parquet file based on the validator's output, drop the new version onto the drop zone to re-validate. Click the โ button first to clear the previous result.
- Privacy. No data leaves your browser. The validator reads only the footer โ not the row data โ and processes everything in JavaScript. It is safe to use with production datasets, customer records, or any sensitive columnar data.
