Reading Files Recursively in Spark/Databricks

Mar 7, 2022 | 203 Words | 1 min | #apache-spark #python #databricks

I’m looking at ways to read JSON files recursively from Databricks, specifically in abfss:// filesystem. Unlike flat filesystems, it can have empty folders, meaning patters like abfss://container@account.dfs.core.windows.net/folder1/y=2021/m=*/d=*/*.json may actually unnaturally fail with something like

pyspark.sql.utils.AnalysisException: Path does not exist: abfss://container@account.dfs.core.windows.net/folder1/y=2021/m=*/d=*/*.json

but the path does exist! The reason for this error is there may be partial empty folders on path pattern, and Spark doesn’t like it. It’s used to work with something like S3 or plain Azure blob storage much better.

Simple Solution

The simplest solution is not to use patterns at all. Replace

df = spark.read.json("abfss://container@account.dfs.core.windows.net/folder1/y=2021/m=*/d=*/*.json")

with

df = spark.read.json("abfss://container@account.dfs.core.windows.net/",
                    recursiveFileLookup=True)

and this does work perfectly well, on any filesystem.

Filtering

The only issue is filtering - if I need to only get records for year 2021 and not others, there is no way to do that, other than

df = spark.read.json("abfss://container@account.dfs.core.windows.net/folder1/y=2021/",
                    recursiveFileLookup=True)

and it works perfectly well. But I’m lucky as filtering in this case can be done by simply changing the root folder.

If I need say to only get the first month for all the years there’s no luck. There’s a parameter called pathGlobFilter which is only applicable to file part pattern. Want to filter on folder path? Give up.

Em, excuse me! Have Android 📱 and use Databricks? You might be interested in my totally free (and ad-free) Pocket Bricks . You can get it from Google Play too:

To contact me, send an email anytime or leave a comment below.

Simple Solution

Filtering

See Also