Reputation: 11
I want to install a Python3 package from a private Github repository onto an AWS EMR Spark cluster.
I know how to do this the dirty way by hardcoding credentials but what is the recommended best practice to do this safely ? I don't want to store credentials in a bootstrap script...
Thanks in advance.
Upvotes: 0
Views: 846
Reputation: 11
Thanks to Maurice I've successfully implemented a safe process, following his option #2.
Create an access token with read credentials on github.
Store this in AWS Secrets Manager. In my case I named this secret "github-read-access"
Give access to this secret to the user that is going to query it, or in the case of a bootstrap EMR script, to the EMR roles.
Using aws CLI I store the token as an environment variable and install the package with the following commands:
export GITHUB_TOKEN=`aws secretsmanager get-secret-value --secret-id github-read-access |grep SecretString|cut -d ":" -f 3|cut -d '"' -f 2 |cut -d '\' -f1`
sudo pip3 install git+https://${GITHUB_TOKEN}@github.com/<USER_NAME>/<REPO_NAME>.git
Upvotes: 1
Reputation: 13107
Caveat: I haven't worked with custom EMR bootstrapping scripts, but I assume they're not too different from regular user data scripts.
There are some options:
my/git/credentials
and even encrypt them using the Key Management Service. In your bootstrapping script you can then request the credentials using the AWS CLI and use them to connect to the private git repository. This requires the instance role of the cluster to have permissions to access that parameter (and the KMS-Key if you've encrypted the value)I'd personally start with option 1, it will be cheaper. If you have specific audit/regulatory requirements, I'd look at option 2 - it's slightly more complex.
Upvotes: 0