Reputation: 2082
I have aws cli
and boto3
installed in my python 2.7
environment. I want to do various operations like get schema information, get database details of all the tables present in AWS Glue console. I tried below samples of scripts:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
persons = glueContext.create_dynamic_frame.from_catalog(
database="records",
table_name="recordsrecords_converted_json")
print "Count: ", persons.count()
persons.printSchema()
I got error ImportError: No module named awsglue.transforms
which should be correct as there is no such package present in boto3 as I identified using the command dir(boto3)
. I found that boto3
offers various client calls through awscli
and we can access them by using client=boto3.client('glue')
. So, for getting schema information as above, I tried below sample code:
import sys
import boto3
client=boto3.client('glue')
response = client.get_databases(
CatalogId='string',
NextToken='string',
MaxResults=123
)
print client
But then I get this error:
AccessDeniedException: An error occurred (AccessDeniedException) when calling the GetDatabases operation: Cross account access is not allowed.
I am pretty sure that either one of them or probably both of them are correct approaches to get what I am trying to get but something doesn't fall into correct slots here. Any ideas to get the details about the schema and database tables from AWS Glue using python 2.7 locally like I tried above?
Upvotes: 6
Views: 6957
Reputation: 3153
The following code works for me, and am using locally setup Zeppelin notebook, as a dev end point. The printschema reads the schema from the data catalog.
Hope you have enabled the ssh tunnelling as well.
%pyspark
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
# Create a DynamicFrame using the 'persons_json' table
medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(database="payments", table_name="medicaremedicare_hospital_provider_csv")
# Print out information about this data
print "Count: ", medicare_dynamicframe.count()
medicare_dynamicframe.printSchema()
Also you may need to make some changes for Spark interpreter, (tick on the Connect to existing process option available in the top, and host(localhost), port number (9007).
For second part You need to to do aws configure
and then create glue client after installing boto3
client. After this, check your proxy settings for hiding behind a firewall or company network.
To be clear, boto3
client is helpful for all AWS related client side api and for server side, Zeppelin way is the best.
Hope this helps.
Upvotes: 3