How to list partition names in Delta table using Apache Spark

Nov 22, 2024 | 378 Words | 2 min | #apache-spark #delta-lake

TL;DR

To list partition names in a Delta Table using PySpark, follow these steps:

Import Libraries: Import DeltaTable from the Delta Lake library and initialize a Spark session.
Specify Path: Define the path to your Delta table (e.g., an S3 bucket or local file system).
Retrieve Details: Use DeltaTable.forPath to get table details, which includes partition columns.

Here’s a quick code snippet:

from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DeltaLakeExample") \
    .getOrCreate()

path = "s3a://path/to/delta/table"
details = DeltaTable.forPath(spark, path).detail()
details.show()

This will print a DataFrame with a column called partitionColumns, which contains the partition names.

Full details

When working with Delta Lake tables, you might often need to list the partition names for various data management tasks. If all you have is the path to a Delta table, you can easily achieve this using PySpark.

Prerequisites

Before we dive into the code, ensure you have the following prerequisites:

Apache Spark: Make sure you have Apache Spark installed. You can download it from the official website.

Delta Lake: Ensure you have the Delta Lake library installed. You can add it to your Spark session using the following configuration:

spark = SparkSession.builder \
    .appName("DeltaLakeExample") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

Step-by-Step Guide

Step 1: Import Required Libraries

First, import the necessary libraries. You’ll need the DeltaTable class from the Delta Lake library.

from delta.tables import DeltaTable
from pyspark.sql import SparkSession

Step 2: Initialize Spark Session

Next, initialize a Spark session. This session is required to interact with Delta Lake tables.

spark = SparkSession.builder \
    .appName("DeltaLakeExample") \
    .getOrCreate()

Step 3: Define the Path to Your Delta Table

Specify the path to your Delta table. This path can be an S3 bucket, a local file system path, or any other supported storage location.

path = "s3a://path/to/delta/table"

Step 4: Retrieve Table Details

Use the DeltaTable.forPath method to get the details of the Delta table. This method returns a DataFrame containing various metadata about the table, including partition columns.

details = DeltaTable.forPath(spark, path).detail()
details.show()

The details DataFrame will have a column called partitionColumns, which is a string array containing the names of the partition columns.

Example Output

When you run the above code, you should see an output similar to this:

…	partitionColumns	…
…	[sensorId]	…

To contact me, send an email anytime or leave a comment below.