Reading Files Recursively in Spark/Databricks
I’m looking at ways to read JSON files recursively from Databricks, specifically in abfss://
filesystem. Unlike flat filesystems, it can have empty folders, meaning patters like abfss://container@account.dfs.core.windows.net/folder1/y=2021/m=*/d=*/*.json
may actually unnaturally fail with something like
pyspark.sql.utils.AnalysisException: Path does not exist: abfss://container@account.dfs.core.windows.net/folder1/y=2021/m=*/d=*/*.json
but the path does exist! The reason for this error is there may be partial empty folders on path pattern, and Spark doesn’t like it. It’s used to work with something like S3 or plain Azure blob storage much better.
Simple Solution
The simplest solution is not to use patterns at all. Replace
df = spark.read.json("abfss://container@account.dfs.core.windows.net/folder1/y=2021/m=*/d=*/*.json")
with
df = spark.read.json("abfss://container@account.dfs.core.windows.net/",
recursiveFileLookup=True)
and this does work perfectly well, on any filesystem.
Filtering
The only issue is filtering - if I need to only get records for year 2021 and not others, there is no way to do that, other than
df = spark.read.json("abfss://container@account.dfs.core.windows.net/folder1/y=2021/",
recursiveFileLookup=True)
and it works perfectly well. But I’m lucky as filtering in this case can be done by simply changing the root folder.
If I need say to only get the first month for all the years there’s no luck. There’s a parameter called pathGlobFilter
which is only applicable to file part pattern. Want to filter on folder path? Give up.
Em, excuse me! Have Android 📱 and use Databricks?
You might be interested in my totally free (and ad-free) Pocket Bricks . You can get it from Google Play too:
To contact me, send an email anytime or leave a comment below.