Parquet: Different in Encoding Null lists vs Empty lists

In this blog post, we will conduct a simple experiment to find out what’s the difference in encoding null lists vs empty lists with Parquet. A null list is a list that has a null value, while an empty list is a list that has no elements. For example, in Python, a null list would be None, while an empty list would be [].

To perform the experiment, we will use PySpark (a Python API for Apache Spark) which is a distributed data processing framework that supports reading and writing Parquet files. We will create a Spark DataFrame, with null lists and empty lists, and write them to Parquet file.

The code for creating the DataFrames is as follows:

df = (spark.createDataFrame([
    ("normal", [1, 2, 3]),
    ("empty", []),
    ("normal", [4, 5, 6]),
    ("null", None),
    ],
    StructType([
        StructField("tag", StringType()),
        StructField("ids", ArrayType(IntegerType(), containsNull=False))
    ])
))
df.show()
df.printSchema()

Console output:

|   tag|      ids|
+------+---------+
|normal|[1, 2, 3]|
| empty|       []|
|normal|[4, 5, 6]|
|  null|     null|
+------+---------+

root
 |-- tag: string (nullable = true)
 |-- ids: array (nullable = true)
 |    |-- element: integer (containsNull = true)

So far so good.

Low-level dump of this file with

df.write.mode("overwrite").parquet("c:\\tmp\\list_empty_and_null.parquet")

produces the following schema:

tag:        BINARY SNAPPY DO:4 FPO:52 SZ:83/79/0.95 VC:4 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
ids:
.list:
..element:  INT32 SNAPPY DO:0 FPO:87 SZ:59/60/1.02 VC:8 ENC:PLAIN,RLE

    tag TV=4 RL=0 DL=1 DS: 3 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                 DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[no stats for this column] SZ:10 VC:4

    ids.list.element TV=8 RL=1 DL=2
    ----------------------------------------------------------------------------
    page 0:                 DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] SZ:37 VC:8

and data:

BINARY tag
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 4 ***
value 1: R:0 D:1 V:normal
value 2: R:0 D:1 V:empty
value 3: R:0 D:1 V:normal
value 4: R:0 D:1 V:null

INT32 ids.list.element
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 8 ***
value 1: R:0 D:2 V:1
value 2: R:1 D:2 V:2
value 3: R:1 D:2 V:3
value 4: R:0 D:1 V:<null>
value 5: R:0 D:2 V:4
value 6: R:1 D:2 V:5
value 7: R:1 D:2 V:6
value 8: R:0 D:0 V:<null>

Therefore we can see that empty list is encoded as:

value 4: R:0 D:1 V:<null>

but null list is:

value 8: R:0 D:0 V:<null>

Basically, in both cases repetition level is zero, but definition level is 1 for empty lists and 0 for null lists.


To contact me, send an email anytime or leave a comment below.