Connect from Spark to AWS S3 via Assume Role credential

Mar 10, 2021 | 247 Words | 2 min | #apache-spark #databricks #aws-s3

This is a non trivial one and caused me a day of hadache. I’m using hadoop-aws 3.2.0 and the short answer is:

val hc = spark.sparkContext.hadoopConfiguration
hc.set("fs.s3a.access.key", config.awsKey)
hc.set("fs.s3a.secret.key", config.awsSecret)
hc.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider")
hc.set("fs.s3a.assumed.role.arn", "arn:aws:iam::...:role/...")
hc.set("fs.s3a.assumed.role.sts.endpoint.region", config.awsRegion)
System.setProperty("aws.region", config.awsRegion) // a hack

Long Answer

Assume role is only available since hadoop-aws v3 (Spark 3 is using it already, but if you’re running Spark standalone, make sure you are). You can set it with fs.s3a.assumed.role.arn property, and explicitly selecting org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider as your credential provider. This does not exist in hadoop-aws 2.x, but if you really want to, you can implement a workaround. Another thing to mention is sometimes you will get the following exception when doing it the official way:

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: Instantiate org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider on : com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:189)
	at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProvider(S3AUtils.java:713)
	at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:605)

no matter whether fs.s3a.assumed.role.sts.endpoint.region is set or not. I found deep inside AWS SDK that assume role provider will actually read the region using standard configuration chain, which Spark does not populate properly. Luckily, one of the configuration providers that are easy to modify is a system property, hence the call to System.setProperty. I hope this will be fixed some time in future.

Em, excuse me! Have Android 📱 and use Databricks? You might be interested in my totally free (and ad-free) Pocket Bricks . You can get it from Google Play too:

To contact me, send an email anytime or leave a comment below.

Long Answer

See Also