Setting up Scala for Spark Development
This post is merely a reference-like article to set up your Scala environment for Apache Spark development.
Simplest Thing
Your build.sbt should looks like:
name := "My Project"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.6"
Your Entry.scala:
import org.apache.log4j.{Level, LogManager}
object Entry {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
LogManager.getRootLogger.setLevel(Level.ERROR)
// use spark variable here to write your programs
}
}
Integrating with Azure Pipelines
Azure Pipelines has built-in support for sbt, therefore you can build and package with the following task (simplest version):
- task: CmdLine@2
displayName: "sbt"
inputs:
script: |
sbt clean
sbt update
sbt compile
sbt package
workingDirectory: 'project-dir'
To pass version number, you can use a variable from your pipeline. Say it’s called projectVersion, then pipeline task is:
- task: CmdLine@2
displayName: "sbt"
inputs:
script: |
sbt clean
sbt update
sbt compile
sbt package
workingDirectory: 'project-dir'
env:
v: $(projectVersion)
which merely creates an environment variable called v for the sbt task. To pick it up, just modify version line for build.sbt:
version := None.orElse(sys.env.get("v")).orElse(Some("0.1")).get
You can create uber JAR, however they are relatively large (70kb grows into over 100Mb) therefore I’d try to avoid it.
