Setting up Scala for Spark Development

Oct 5, 2020 | | 1 min | #apache-spark #scala

This post is merely a reference-like article to set up your Scala environment for Apache Spark development.

Simplest Thing

Your build.sbt should looks like:

name := "My Project"

version := "0.1"

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.6"

Your Entry.scala:

import org.apache.log4j.{Level, LogManager}

object Entry {
  def main(args: Array[String]) {

    val spark = SparkSession
      .builder()
      .master("local")
      .getOrCreate()
    LogManager.getRootLogger.setLevel(Level.ERROR)
      
    // use spark variable here to write your programs

  }
}

Integrating with Azure Pipelines

Azure Pipelines has built-in support for sbt, therefore you can build and package with the following task (simplest version):

- task: CmdLine@2
  displayName: "sbt"
  inputs:
    script: |
      sbt clean
      
      sbt update
          
      sbt compile
          
      sbt package      
    workingDirectory: 'project-dir'

To pass version number, you can use a variable from your pipeline. Say it’s called projectVersion, then pipeline task is:

- task: CmdLine@2
  displayName: "sbt"
  inputs:
    script: |
      sbt clean
      
      sbt update
          
      sbt compile
          
      sbt package      
    workingDirectory: 'project-dir'
  env:
    v: $(projectVersion)

which merely creates an environment variable called v for the sbt task. To pick it up, just modify version line for build.sbt:

version := None.orElse(sys.env.get("v")).orElse(Some("0.1")).get

You can create uber JAR, however they are relatively large (70kb grows into over 100Mb) therefore I’d try to avoid it.