The parquet CLI is a powerful tool for inspecting parquet files. I use it a lot for work at Starburst but I have not found much documentation that explains how to understand all the output that the tool provides.
Typically, the information I am interested in gleaning from a parquet file is:
- how many row groups are in the file and what is the size of the row groups
- what encodings are used for each column
- column statistics
- if dictionary encoding was used for a column, whether or not fallback to plain encoding occurred
- column indexes
In this post, I will cover the commands I commonly use and how to read the output that these commands provide to get the information I am interested in.
I am only going to cover the output that would be seen when inspecting parquet
files that were created by version 1.12 or great of the parquet-mr
library.
File metadata
To view the metadata for a parquet file with the parquet CLI, we execute:
parquet-cli meta <path-to-parquet-file>
The output for this will first show you a header with information on the file:
File path: 128_row_group.parquet
Created by: parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
Properties:
writer.time.zone: GMT
Next, it will show you the schema for the data contained in the parquet file:
Schema:
message hive_schema {
optional int64 orderkey;
optional int64 partkey;
optional int64 suppkey;
optional int32 linenumber;
optional double quantity;
optional double extendedprice;
optional double discount;
optional double tax;
optional binary returnflag (STRING);
optional binary linestatus (STRING);
optional int32 shipdate (DATE);
optional int32 commitdate (DATE);
optional int32 receiptdate (DATE);
optional binary shipinstruct (STRING);
optional binary shipmode (STRING);
optional binary comment (STRING);
}
Then there will be information for each row group in the file:
Row group 0: count: 4710100 28.38 B records start: 4 total(compressed): 127.497 MB total(uncompressed):313.387 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
orderkey INT64 Z _ R_ F 4710100 0.48 B 0 "4500010215" / "5167807264"
partkey INT64 Z _ 4710100 4.30 B 0 "50" / "199999871"
The header for each row group has the following metadata:
- Row group: tells you which row group this is in the file (row group indices start at 0)
- count: the number of rows in the row group and the size in bytes of each row
- total(compressed): compressed size of the data in the row group
- total(uncompressed): uncompressed size of the data in the row group
In this example, the parquet file was generated with a row group size of 128MB. This matches the compressed size of this row group - 127.497MB.
After the row group header, there will be a row for each column in the row group.
What we are typically most interested in these rows is the encodings
field and
the min / max
field.
The encodings
field will contain a collection of characters. The encodings
field
is to be read by looking at each character in the string individually.
The possible characters you may see are:
- first and second - compression codec (possible options)
- third and fourth - dictionary encoding or not
- fifth, sixth, seventh - encoding of column (can be up to 3? characters)
- eigth - whether or not fallback occurred
For the first character that represents the compression codec, the characters you may see are:
_
- uncompressedS
- snappyG
- gzipL
- LZ0B
- brotli4
- LZ4Z
- ZSTD?
- unknown
For the dictionary encoding you will either see an empty space or a _
character.
For the column encoding, you may see 3 characters:
R
- RLE or plain dictionary encoding_
- plain encodingD
- delta encoding
As an example of how to read the encodings
field, we will look at the orderkey
column
first. The encoding for this column is: Z _ R_ F
. Each character in this string
has a specific meaning:
Z
- the compression codec; in this case ZSTD_
- a leading_
character in this field indicates the column has a dictionary encodingR
- RLE is used for the dictionary encoding_
- some of the data pages for this column are plain encodedF
- this dictionary encoded column fell back to plain encoding because the size of the dictionary became large than the dictionary page size
Next, we will look at the partkey
column encodings: Z _
We know this means the compression codec is ZSTD and this column only has a
plain encoding because of the placement of the _
character.
After the encoding information, it tells us for each column: the number of records, average size, and number of nulls.
Then finally, it tells us one of the most useful pieces of information for each column in the row group: the min/max values for the column in that row group.
This min/max information is very important for letting us know if min/max filtering of row groups will be effective for predicates on a particular column.
In this example, we see that the min/max range for the orderkey
column is
small but the range for the partkey
column is quite large. This tells us
that we can expect min/max filtering to be effective for the orderkey
column
but not as effective for the partkey
column.
Dictionary metadata
For a dictionary encoded column, the parquet-cli
tool has an option to view
the dictionary for that column:
parquet-cli dictionary <path-to-parquet-file> -c <column-name>
This command will output all keys/values for the dictionary in each row group for the specified column.
Using the same file as in our previous example, we can view information
for the dictionaries for the orderkey
column with:
parquet-cli dictionary <path-to-parquet-file> -c orderkey
This will have output like:
Row group 0 dictionary for "orderkey":
0: 4951167618
1: 4951167619
2: 4951167620
3: 4951167621
...
Row group 1 dictionary for "orderkey":
0: 4930598149
1: 4930598150
2: 4930598151
3: 4930598176
4: 4930598177
...
Row group 2 dictionary for "orderkey":
0: 4524740420
1: 4524740421
2: 4524740422
3: 4524740423
4: 4524740448
...
Row group 3 dictionary for "orderkey":
0: 4847377187
1: 4847377188
2: 4847377189
3: 4847377190
4: 4847377191
This parquet file has 4 row groups in it so we can see information for 4
dictionaries. The values for each dictionary are the actual values for the column.
In this case, the column is orderkey
so they values are integers.
But you will see the same with other types. For example, if we look at the
dictionaries for the shipmode
column in this table:
Row group 0 dictionary for "shipmode":
0: "SHIP"
1: "MAIL"
2: "AIR"
3: "REG AIR"
4: "TRUCK"
5: "RAIL"
6: "FOB"
Row group 1 dictionary for "shipmode":
0: "SHIP"
...
You can see the values for the dictionary are the column values. Since this is a very low cardinality column, there are only 7 distinct values.
Individual pages
We can also use the parquet-cli
to view metadata for pages in a
parquet file:
parquet-cli pages <path-to-parquet-file>
This command will show metadata for every page on a per column basis. Each page identifier has a particular format:
<row_group_number>-<page_number>
For a dictionary page, the <page_number>
will be D
.
For example, here is some of the output of the command for the orderkey
column
we have looked at before:
Column: orderkey
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict Z _ 129990 8.00 B 1015.547 kB
0-1 data Z R 20000 1.63 B 31.786 kB
0-2 data Z R 20000 1.75 B 34.228 kB
...
1-D dict Z _ 130108 8.00 B 1016.469 kB
1-1 data Z R 20000 1.63 B 31.786 kB
1-2 data Z R 20000 1.75 B 34.228 kB
...
2-D dict Z _ 130244 8.00 B 1017.531 kB
2-1 data Z R 20000 1.63 B 31.786 kB
2-2 data Z R 20000 1.75 B 34.228 kB
...
3-D dict Z _ 129755 8.00 B 1013.711 kB
3-1 data Z R 20000 1.63 B 31.786 kB
3-2 data Z R 20000 1.75 B 34.228 kB
...
So the majority of the pages are data pages. 1 page in each row group for
this column is a dictionary page as indicated by the type
field and the
prefix in the page identifier.
Next, there is information on the encoding for each page. This can be read in a similar way to the encoding for the row group metadata.
The first character is always the compression codec and the second character is then the encoding of the actual page.
So for example Z _
is a ZSTD compressed page with a plain encoding and Z R
is
a ZSTD compressed page with a RLE dictionary encoding.
Next, for each page it shows you the record count in that page following by the average size of each record. Then it tells you the total size of the page.
Notice that the dictionary pages will always be correspond to the parquet dictionary
size which is controlled when writing by the parquet.dictionary.page.size
property.
The default dictionary page size is 1MB which was used when generating these parquet
files.
Column indexes
Another part of a parquet file that is of interest is the column indexes.
Column indexes track min/max statistics for each data page. This can be leveraged to skip individual data pages for a column in a row group. It is essentially the same as row group min/max filtering but just at a finer granularity.
To view information on column indexes, we issue:
parquet-cli column-index <path-to-parquet-file>
This will output the column indexes for each column per row group. It will show you the min/max and null count for each column index.
Here is some example output for our orderkey
column:
row-group 0:
column index for column orderkey:
Boundary order: UNORDERED
null count min max
page-0 0 4500010215 5142863968
page-1 0 4500013217 5035735234
page-2 0 4502151685 4930893862
...
row-group 1:
column index for column orderkey:
Boundary order: UNORDERED
null count min max
page-0 0 4501938144 5144928130
page-1 0 4501945031 5144938595
page-2 0 4716204005 5144942022
...
row-group 2:
column index for column orderkey:
Boundary order: UNORDERED
null count min max
page-0 0 4524740420 5166273987
page-1 0 4523117509 5059623457
page-2 0 4630680775 5166280994
...
row-group 3:
column index for column orderkey:
Boundary order: UNORDERED
null count min max
page-0 0 4633395622 5169011457
page-1 0 4525823204 5169018245
page-2 0 4525829984 5062322054
This can be useful to know whether or not column indexes will be useful for min/max skipping. In this example, we can see each index has a small range of min/max values so we would expect column indexes to be useful for predicates on this column.
For an example where this is not the case:
column index for column partkey:
Boundary order: UNORDERED
null count min max
page-0 0 8850 199995958
page-1 0 37233 199996089
page-2 0 2830 199982784
page-3 0 19052 199995316
Notice the min/max range for each column index here is quite large so we would not expect column indexes to be very useful for predicates on this column.
Summary
This post was mostly for myself to reference in the future when I need to parse output of the parquet CLI. Hopefully it may prove useful to someone!