Automating Databricks with Bash

Mar 15, 2021 | | 3 min | #databricks

This is a collection of most common bash scripts to automate Databricks.

All the scenarios depend on Databricks CLI installed and configured. These examples also use jq extensively which is a part of most Linux distros.

Create or Update a Cluster Instance Pool

Input:

POOL_NAME env var.
CONFIG_PATH env var.

Using Instance Pools CLI.

#!/bin/bash

export POOL_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select (.instance_pool_name == $I) | .instance_pool_id')

if [[ "$POOL_ID" == "" ]]; then
    echo "creating pool"
    databricks instance-pools create --json-file "$CONFIG_PATH"
else
    echo "pool already exists, issuing edit on pool $POOL_ID"
    envsubst < $CONFIG_PATH > ./tmp.json
    cat ./tmp.json
    databricks instance-pools edit --json-file ./tmp.json
fi

Pool configuration file looks like (note the POOL_ID being replaced by envsubst):

{
    "instance_pool_name": "General",
    "instance_pool_id": "$POOL_ID",
    "min_idle_instances": 1,
    "max_capacity": 10,
    "node_type_id": "Standard_DS3_v2",
    "idle_instance_autotermination_minutes": 60,
    "enable_elastic_disk": true,
    "preloaded_spark_versions": [
        "7.3.x-scala2.12"
      ],
    "azure_attributes": {
      "availability": "ON_DEMAND_AZURE"
    }
}

Create or Update a Cluster

Attaching to a pool is suported!

Input:

CLUSTER_NAME env var.
CONFIG_PATH env var.
POOL_NAME env var.

Using Clusters CLI.

#!/bin/bash

export CLUSTER_ID=$(databricks clusters list --output JSON | jq -r --arg I "$CLUSTER_NAME" '.clusters[] | select(.cluster_name == $I) | .cluster_id')
export POOL_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select (.instance_pool_name == $I) | .instance_pool_id')

envsubst < $CONFIG_PATH > ./tmp.json

cat ./tmp.json

if [[ "$CLUSTER_ID" == "" ]]; then
    echo "creating new cluster"
    databricks clusters create --json-file ./tmp.json
else
    echo "cluster already exists, issuing edit on cluster $CLUSTER_ID"
    databricks clusters edit --json-file ./tmp.json
fi

Sample cluster config:

{
    "cluster_name": "$CLUSTER_NAME",
    "cluster_id": "$CLUSTER_ID",
    "spark_version": "7.3.x-scala2.12",
    "autoscale": {
        "min_workers": 1,
        "max_workers": 4
    },
    "num_workers": 1,
    "spark_conf": {
        "spark.databricks.delta.preview.enabled": "true"
    },
    "instance_pool_id": "$POOL_ID"
}

Create or Update a Job by Name

Input:

JOB_NAME set as environment variable.
job.json is a local file describing the job.

Using Jobs CLI.

# check if job exists (by name) and get it's ID
JOB_ID=$(databricks jobs list --output JSON | jq -r --arg I "$JOB_NAME" '.jobs[] | select (.settings.name == $I) | .job_id')

if [[ "$JOB_ID" == "" ]]; then
  echo "creating a new job"
  databricks jobs create --json-file job.json
else
  echo "updating job $JOB_ID"
  databricks jobs reset --job-id $JOB_ID --json-file job.json
fi

Terminate All Job Runs and Start Again

Input:

JOB_ID env var.

#!/bin/bash

# stop all job runs
until [[ -z $(databricks runs list --job-id "$JOB_ID" --active-only --output JSON | jq '.runs | .[]? | .run_id') ]] ;
do
  echo "job is still running...."

  RUN_ID=$(databricks runs list --job-id "$JOB_ID" --active-only --output JSON | jq '.runs | .[]? | .run_id')
  echo "cancelling run '$RUN_ID'"
  databricks runs cancel --run-id "$RUN_ID" > /dev/null

  sleep 5s
done

# start the job again
echo "staring job $JOB_ID"
databricks jobs run-now --job-id "$JOB_ID"

Note the jq syntax to check array for nulls if the job has no current runs.

Tips

The easiest way to set up CLI (especially in CI/CD environment) is to set 2 environment variables - DATABRICKS_HOST and DATABRICKS_TOKEN.