Setting up Scala for Spark Development

This post is merely a reference-like article to set up your Scala environment for Apache Spark development.

Simplest Thing

Your build.sbt should looks like:

name := "My Project"

version := "0.1"

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.6"

Your Entry.scala:

import org.apache.log4j.{Level, LogManager}

object Entry {
  def main(args: Array[String]) {

    val spark = SparkSession
      .builder()
      .master("local")
      .getOrCreate()
    LogManager.getRootLogger.setLevel(Level.ERROR)
      
    // use spark variable here to write your programs

  }
}

Integrating with Azure Pipelines

Azure Pipelines has built-in support for sbt, therefore you can build and package with the following task (simplest version):

- task: CmdLine@2
  displayName: "sbt"
  inputs:
    script: |
      sbt clean
      
      sbt update
          
      sbt compile
          
      sbt package      
    workingDirectory: 'project-dir'

To pass version number, you can use a variable from your pipeline. Say it’s called projectVersion, then pipeline task is:

- task: CmdLine@2
  displayName: "sbt"
  inputs:
    script: |
      sbt clean
      
      sbt update
          
      sbt compile
          
      sbt package      
    workingDirectory: 'project-dir'
  env:
    v: $(projectVersion)

which merely creates an environment variable called v for the sbt task. To pick it up, just modify version line for build.sbt:

version := None.orElse(sys.env.get("v")).orElse(Some("0.1")).get

You can create uber JAR, however they are relatively large (70kb grows into over 100Mb) therefore I’d try to avoid it.

Thanks! You can always email me or use contact form for more questions/comments etc.