Ger
Ger

Reputation: 754

How to add connectors to presto on Amazon EMR

I've set up a small EMR cluster with Hive/Presto installed, I want to query files on S3 and import them to Postgres on RDS.

To run queries on S3 and save the results in a table in postgres I've done the following:

  1. Started a 3 node EMR cluster from the AWS console.
  2. Manually SSH into the Master node to create an EXTERNAL table in hive, looking at an S3 bucket.
  3. Manually SSH into each of the 3 nodes and add a new catalog file:

    /etc/presto/conf.dist/catalog/postgres.properties 
    

    with the following contents

    connector.name=postgresql
    connection-url=jdbc:postgresql://ip-to-postgres:5432/database
    connection-user=<user>
    connection-password=<pass>
    

    and edited this file

    /etc/presto/conf.dist/config.properties
    

    adding

    datasources=postgresql,hive
    
  4. Restart presto by running the following manually on all 3 nodes

    sudo restart presto-server
    

This setup seems to work well.

In my application, there are multiple databases created dynamically. It seems that those configuration/catalog changes need to be made for each database and the server needs to be restarted to see the new config changes.

Is there a proper way for my application (using boto or other methods) to update configurations by

  1. Adding a new catalog file in all nodes /etc/presto/conf.dist/catalog/ for each new database
  2. Adding a new entry in all nodes in /etc/presto/conf.dist/config.properties
  3. Gracefully restarting presto across the whole cluster (ideally when it becomes idle, but that's not a major concern.

Upvotes: 4

Views: 4937

Answers (3)

G&#246;rkem M&#252;layim
G&#246;rkem M&#252;layim

Reputation: 1219

I believe you can run a simple bash script to achieve what you want. There is no other way except creating a new cluster with --configurations parameter where you provide the desired configurations. You can run below script from the master node.

#!/bin/sh

# "cluster_nodes.txt" with private IP address of each node.
aws emr list-instances --cluster-id <cluster-id> --instance-states RUNNING | grep PrivateIpAddress | sed 's/"PrivateIpAddress"://' | sed 's/\"//g' | awk '{gsub(/^[ \t]+|[ \t]+$/,""); print;}' > cluster_nodes.txt

# For each IP connect with ssh and configure.
while IFS='' read -r line || [[ -n "$line" ]]; do
    echo "Connecting $line"
    scp -i <PEM file> postgres.properties hadoop@$line:/tmp;
    ssh -i <PEM file> hadoop@$line "sudo mv /tmp/postgres.properties /etc/presto/conf/catalog;sudo chown presto:presto /etc/presto/conf/catalog/postgres.properties;sudo chmod 644 /etc/presto/conf/catalog/postgres.properties; sudo restart presto-server";
done < cluster_nodes.txt

Upvotes: 2

mostafazh
mostafazh

Reputation: 4207

You can provide the configuration via the management console as follows:

Create Cluster EMR Console

Or you can use the awscli to pass those configurations as follows:

#!/bin/bash

JSON=`cat <<JSON
[
  { "Classification": "presto-connector-postgresql",
    "Properties": {
      "connection-url": "jdbc:postgresql://ip-to-postgres:5432/database",
      "connection-user": "<user>",
      "connection-password": "<password>"
    },
    "Configurations": []
  }
]
JSON`

aws emr create-cluster --configurations "$JSON" # ... reset of params

Upvotes: 0

muruga
muruga

Reputation: 1073

During Provision of your cluster: You can provide the configuration details at the time of provision.

Refer to Presto Connector Configuration on how to add this automatically during the provision of your cluster.

Upvotes: 0

Related Questions