Automating Databricks with Bash
This is a collection of most common bash scripts to automate Databricks.
All the scenarios depend on Databricks CLI installed and configured. These examples also use jq extensively which is a part of most Linux distros.
Create or Update a Cluster Instance Pool
Input:
- POOL_NAMEenv var.
- CONFIG_PATHenv var.
Using Instance Pools CLI.
#!/bin/bash
export POOL_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select (.instance_pool_name == $I) | .instance_pool_id')
if [[ "$POOL_ID" == "" ]]; then
    echo "creating pool"
    databricks instance-pools create --json-file "$CONFIG_PATH"
else
    echo "pool already exists, issuing edit on pool $POOL_ID"
    envsubst < $CONFIG_PATH > ./tmp.json
    cat ./tmp.json
    databricks instance-pools edit --json-file ./tmp.json
fi
Pool configuration file looks like (note the POOL_ID being replaced by envsubst):
{
    "instance_pool_name": "General",
    "instance_pool_id": "$POOL_ID",
    "min_idle_instances": 1,
    "max_capacity": 10,
    "node_type_id": "Standard_DS3_v2",
    "idle_instance_autotermination_minutes": 60,
    "enable_elastic_disk": true,
    "preloaded_spark_versions": [
        "7.3.x-scala2.12"
      ],
    "azure_attributes": {
      "availability": "ON_DEMAND_AZURE"
    }
}
Create or Update a Cluster
Attaching to a pool is suported!
Input:
- CLUSTER_NAMEenv var.
- CONFIG_PATHenv var.
- POOL_NAMEenv var.
Using Clusters CLI.
#!/bin/bash
export CLUSTER_ID=$(databricks clusters list --output JSON | jq -r --arg I "$CLUSTER_NAME" '.clusters[] | select(.cluster_name == $I) | .cluster_id')
export POOL_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select (.instance_pool_name == $I) | .instance_pool_id')
envsubst < $CONFIG_PATH > ./tmp.json
cat ./tmp.json
if [[ "$CLUSTER_ID" == "" ]]; then
    echo "creating new cluster"
    databricks clusters create --json-file ./tmp.json
else
    echo "cluster already exists, issuing edit on cluster $CLUSTER_ID"
    databricks clusters edit --json-file ./tmp.json
fi
Sample cluster config:
{
    "cluster_name": "$CLUSTER_NAME",
    "cluster_id": "$CLUSTER_ID",
    "spark_version": "7.3.x-scala2.12",
    "autoscale": {
        "min_workers": 1,
        "max_workers": 4
    },
    "num_workers": 1,
    "spark_conf": {
        "spark.databricks.delta.preview.enabled": "true"
    },
    "instance_pool_id": "$POOL_ID"
}
Create or Update a Job by Name
Input:
- JOB_NAMEset as environment variable.
- job.jsonis a local file describing the job.
Using Jobs CLI.
# check if job exists (by name) and get it's ID
JOB_ID=$(databricks jobs list --output JSON | jq -r --arg I "$JOB_NAME" '.jobs[] | select (.settings.name == $I) | .job_id')
if [[ "$JOB_ID" == "" ]]; then
  echo "creating a new job"
  databricks jobs create --json-file job.json
else
  echo "updating job $JOB_ID"
  databricks jobs reset --job-id $JOB_ID --json-file job.json
fi
Terminate All Job Runs and Start Again
Input:
- JOB_IDenv var.
#!/bin/bash
# stop all job runs
until [[ -z $(databricks runs list --job-id "$JOB_ID" --active-only --output JSON | jq '.runs | .[]? | .run_id') ]] ;
do
  echo "job is still running...."
  RUN_ID=$(databricks runs list --job-id "$JOB_ID" --active-only --output JSON | jq '.runs | .[]? | .run_id')
  echo "cancelling run '$RUN_ID'"
  databricks runs cancel --run-id "$RUN_ID" > /dev/null
  sleep 5s
done
# start the job again
echo "staring job $JOB_ID"
databricks jobs run-now --job-id "$JOB_ID"
Note the jq syntax to check array for nulls if the job has no current runs.
Tips
The easiest way to set up CLI (especially in CI/CD environment) is to set 2 environment variables - DATABRICKS_HOST and DATABRICKS_TOKEN.
Em, excuse me! Have Android 📱 and use Databricks?
You might be interested in my totally free (and ad-free) Pocket Bricks . You can get it from Google Play too:

To contact me, send an email anytime or leave a comment below.
