PySpark: Clean Map With UDF
Suppose you have a map with a lot of null values and non-null keys (common in Spark calculations) and you want to clean it - remove keys that have no values. This reduces cognitive load and makes data payload smaller. With pyspark you can do that with the following UDF:
def map_clean(input):
if not input:
return None
return {k:v for k, v in input.items() if v}
map_clean_udf = f.udf(lambda x: map_clean(x), "map<string, string>")
df.select("tags", map_clean_udf("tags")).show(100, False)
The input is map<string, string>
as is the output. When running on a sample dataset, the following sample cell:
{ua -> Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0, url -> https://www.aloneguid.uk/posts/2021/11/cpp-ui-frameworks/, referrer -> https://www.bing.com/, out -> null, width -> 1536, height -> 654}
gets transformed to:
{ua -> Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0, url -> https://www.aloneguid.uk/posts/2021/11/cpp-ui-frameworks/, referrer -> https://www.bing.com/, width -> 1536, height -> 654}
out
was chucked out because it was null.
To contact me, send an email anytime or leave a comment below.