How to list partition names in Delta table using Apache Spark
TL;DR
To list partition names in a Delta Table using PySpark, follow these steps:
- Import Libraries: Import
DeltaTable
from the Delta Lake library and initialize a Spark session. - Specify Path: Define the path to your Delta table (e.g., an S3 bucket or local file system).
- Retrieve Details: Use
DeltaTable.forPath
to get table details, which includes partition columns.
Here’s a quick code snippet:
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DeltaLakeExample") \
.getOrCreate()
path = "s3a://path/to/delta/table"
details = DeltaTable.forPath(spark, path).detail()
details.show()
This will print a DataFrame with a column called partitionColumns
, which contains the partition names.
Full details
When working with Delta Lake tables, you might often need to list the partition names for various data management tasks. If all you have is the path to a Delta table, you can easily achieve this using PySpark.
Prerequisites
Before we dive into the code, ensure you have the following prerequisites:
-
Apache Spark: Make sure you have Apache Spark installed. You can download it from the official website.
-
Delta Lake: Ensure you have the Delta Lake library installed. You can add it to your Spark session using the following configuration:
spark = SparkSession.builder \ .appName("DeltaLakeExample") \ .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \ .getOrCreate()
Step-by-Step Guide
Step 1: Import Required Libraries
First, import the necessary libraries. You’ll need the DeltaTable
class from the Delta Lake library.
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
Step 2: Initialize Spark Session
Next, initialize a Spark session. This session is required to interact with Delta Lake tables.
spark = SparkSession.builder \
.appName("DeltaLakeExample") \
.getOrCreate()
Step 3: Define the Path to Your Delta Table
Specify the path to your Delta table. This path can be an S3 bucket, a local file system path, or any other supported storage location.
path = "s3a://path/to/delta/table"
Step 4: Retrieve Table Details
Use the DeltaTable.forPath
method to get the details of the Delta table. This method returns a DataFrame
containing various metadata about the table, including partition columns.
details = DeltaTable.forPath(spark, path).detail()
details.show()
The details
DataFrame
will have a column called partitionColumns
, which is a string array containing the names of the partition columns.
Example Output
When you run the above code, you should see an output similar to this:
… | partitionColumns | … |
---|---|---|
… | [sensorId] | … |
To contact me, send an email anytime or leave a comment below.