niskeycaustic
niskeycaustic

Reputation: 11

How to *safely* install a python private package from github in an AWS EMR bootstrap script

I want to install a Python3 package from a private Github repository onto an AWS EMR Spark cluster.

I know how to do this the dirty way by hardcoding credentials but what is the recommended best practice to do this safely ? I don't want to store credentials in a bootstrap script...

Thanks in advance.

Upvotes: 0

Views: 846

Answers (2)

niskeycaustic
niskeycaustic

Reputation: 11

Thanks to Maurice I've successfully implemented a safe process, following his option #2.

  1. Create an access token with read credentials on github.

  2. Store this in AWS Secrets Manager. In my case I named this secret "github-read-access"

  3. Give access to this secret to the user that is going to query it, or in the case of a bootstrap EMR script, to the EMR roles.

  4. Using aws CLI I store the token as an environment variable and install the package with the following commands:

    export GITHUB_TOKEN=`aws secretsmanager get-secret-value --secret-id github-read-access |grep SecretString|cut -d ":" -f 3|cut -d '"' -f 2 |cut -d '\' -f1`
    sudo pip3 install git+https://${GITHUB_TOKEN}@github.com/<USER_NAME>/<REPO_NAME>.git 
    

Upvotes: 1

Maurice
Maurice

Reputation: 13107

Caveat: I haven't worked with custom EMR bootstrapping scripts, but I assume they're not too different from regular user data scripts.

There are some options:

  1. Systems Manager Parameter Store: This is essentially something like the windows registry in AWS, a regional key-value store. You can store your credentials here under a name such as my/git/credentials and even encrypt them using the Key Management Service. In your bootstrapping script you can then request the credentials using the AWS CLI and use them to connect to the private git repository. This requires the instance role of the cluster to have permissions to access that parameter (and the KMS-Key if you've encrypted the value)
  2. Secrets Manager: The general idea is similar to the SSM parameter store. The secrets manager also allows you to store your credentials in a secure way, in this case encryption is mandatory. It even offers lifecycle hooks to periodically renew the credentials should you require that. You can use the same technique I described in option 1) in the bootstrapping script. The requirements in terms of permissions are similar, although you definitely have to add KMS permissions and Secrets Manger permissions here. In this case you'd have to parse the JSON response from the Secrets Manager though.

I'd personally start with option 1, it will be cheaper. If you have specific audit/regulatory requirements, I'd look at option 2 - it's slightly more complex.

Upvotes: 0

Related Questions