Set up Standalone Scala SBT Application with Delta Lake

Delta Lake is an open-source storage framework that enables building a lakehouse architecture with compute engines like Spark, PrestoDB, Flink, Trino, and Hive. It provides ACID transactions, scalable metadata handling, time travel, and schema enforcement for data stored in cloud or on-premises object stores. In this blog post, we will show you how to use Delta Lake with PySpark to create, read, update, and overwrite Delta tables.

Setting it up is relatively easy, and the only issue you might face is that certain Delta Lake dependencies are not compatible with certain versions of Spark. There is a compatibility matrix that shows which versions of Delta Lake work with which versions of Spark: https://docs.delta.io/latest/releases.html#compatibility-with-apache-spark:

Delta Lake version Apache Spark version
1.1 3.2.x
1.0.x 3.1.x
0.7.x and 0.8.x 3.0.x
Below 0.7.0 2.4.2 - 2.4.

Sample working build.sbt

name := "myproject"

version := "0.1"

scalaVersion := "2.12.15"
val sparkVersion = "3.2.0"

libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion

libraryDependencies += "io.delta" %% "delta-core" % "1.1.0"

And minimal code:

package uk.aloneguid.myproject

import org.apache.spark.sql.{Row, SaveMode, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import io.delta.tables._

object Delta {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .master("local[1]")
      .getOrCreate()

    val data = Seq(Row(1, "Aloneguid"), Row(2, "Blogging"))

    val schema = StructType(Seq(
      StructField("id", IntegerType),
      StructField("subject", StringType)))

    val df = spark
      .createDataFrame(
        spark.sparkContext.parallelize(data),
        schema)

    df.write.format("delta").mode(SaveMode.Overwrite).save("c://tmp//delta.test")

    df.show()
  }
}


To contact me, send an email anytime or leave a comment below.