Making Pydeequ Work

Making Pydeequ work was a bit of a headache as instructions are unclear. Clean instructions here.

First, Pydeequ works with both Spark 3.2 and 3.3 - I haven’t tried other versions. Regardless of Spark version, you need to install maven package in whatever way you install it on your Spark distro.

Locally, this will look like this:

spark = (SparkSession
    .config("spark.jars.packages", "")

This is unlike official instructions, below is wrong:

from pyspark.sql import SparkSession, Row
import pydeequ

spark = (SparkSession
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)

Those instructions are outdated and simply won’t work - I’ve spent a long time trying and figuring it out. Essentially Python code in Pydeequ will try to read SPARK_VERSION environment variable and then set pydeequ.deequ_maven_coord variable which you can use to initialise jar coordinates. But this simply doesn’t work.

Also you may need to restart Jupyter and/or Spark cluster, especially after installing jar packages from Spark session - I don’t know why.

In general, the whole thing feels extremely fragile. Once you make it work, freeze versions of everything spark and deequ related, as well as pip references, and never touch it again.

Thanks! You can always email me or use contact form for more questions/comments etc.