PySpark: Clean Map With UDF

Suppose you have a map with a lot of null values and non-null keys (common in Spark calculations) and you want to clean it - remove keys that have no values. This reduces cognitive load and makes data payload smaller. With pyspark you can do that with the following UDF:

def map_clean(input):
    if not input:
        return None
    return {k:v for k, v in input.items() if v}

map_clean_udf = f.udf(lambda x: map_clean(x), "map<string, string>")

df.select("tags", map_clean_udf("tags")).show(100, False)

The input is map<string, string> as is the output. When running on a sample dataset, the following sample cell:

{ua -> Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0, url -> https://www.aloneguid.uk/posts/2021/11/cpp-ui-frameworks/, referrer -> https://www.bing.com/, out -> null, width -> 1536, height -> 654}

gets transformed to:

{ua -> Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0, url -> https://www.aloneguid.uk/posts/2021/11/cpp-ui-frameworks/, referrer -> https://www.bing.com/, width -> 1536, height -> 654}

out was chucked out because it was null.


To contact me, send an email anytime or leave a comment below.