Parquet: Different in Encoding Null lists vs Empty lists
In this blog post, we will conduct a simple experiment to find out what’s the difference in encoding null lists vs empty lists with Parquet. A null list is a list that has a null value, while an empty list is a list that has no elements. For example, in Python, a null list would be None
, while an empty list would be []
.
To perform the experiment, we will use PySpark (a Python API for Apache Spark) which is a distributed data processing framework that supports reading and writing Parquet files. We will create a Spark DataFrame
, with null lists and empty lists, and write them to Parquet file.
The code for creating the DataFrames is as follows:
df = (spark.createDataFrame([
("normal", [1, 2, 3]),
("empty", []),
("normal", [4, 5, 6]),
("null", None),
],
StructType([
StructField("tag", StringType()),
StructField("ids", ArrayType(IntegerType(), containsNull=False))
])
))
df.show()
df.printSchema()
Console output:
| tag| ids|
+------+---------+
|normal|[1, 2, 3]|
| empty| []|
|normal|[4, 5, 6]|
| null| null|
+------+---------+
root
|-- tag: string (nullable = true)
|-- ids: array (nullable = true)
| |-- element: integer (containsNull = true)
So far so good.
Low-level dump of this file with
df.write.mode("overwrite").parquet("c:\\tmp\\list_empty_and_null.parquet")
produces the following schema:
tag: BINARY SNAPPY DO:4 FPO:52 SZ:83/79/0.95 VC:4 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
ids:
.list:
..element: INT32 SNAPPY DO:0 FPO:87 SZ:59/60/1.02 VC:8 ENC:PLAIN,RLE
tag TV=4 RL=0 DL=1 DS: 3 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[no stats for this column] SZ:10 VC:4
ids.list.element TV=8 RL=1 DL=2
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] SZ:37 VC:8
and data:
BINARY tag
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 4 ***
value 1: R:0 D:1 V:normal
value 2: R:0 D:1 V:empty
value 3: R:0 D:1 V:normal
value 4: R:0 D:1 V:null
INT32 ids.list.element
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 8 ***
value 1: R:0 D:2 V:1
value 2: R:1 D:2 V:2
value 3: R:1 D:2 V:3
value 4: R:0 D:1 V:<null>
value 5: R:0 D:2 V:4
value 6: R:1 D:2 V:5
value 7: R:1 D:2 V:6
value 8: R:0 D:0 V:<null>
Therefore we can see that empty list is encoded as:
value 4: R:0 D:1 V:<null>
but null list is:
value 8: R:0 D:0 V:<null>
Basically, in both cases repetition level is zero, but definition level is 1
for empty lists and 0
for null lists.
To contact me, send an email anytime or leave a comment below.