Reputation: 1980
I have parquet file in S3 over which I would like to create a glue table. I'm using the boto3 python API to do so. Here is the code :
import boto3
c = boto3.client('glue')
c.create_table(DatabaseName='staging',
TableInput={
'Name': 'test_table',
'StorageDescriptor': {
'Location': 's3://staging/tables/test_table/version=2020-03-26',
'OutputFormat':'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'},
'PartitionKeys': [{'Name': 'country'}, {'Name': 'city'}],
'TableType': 'EXTERNAL_TABLE'})
If we take a look at the Location field we can see the url to the data. In the folder version=2020-03-26 we have folder like by 'country=country name', and inside every of those folder we have other folders like 'city=city name' that countains the parquet files.
The excution of this code returns:
{'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
'content-length': '2',
'content-type': 'application/x-amz-json-1.1',
'date': 'Mon, 06 Apr 2020 08:46:58 GMT',
'x-amzn-requestid': 'ca5e4af0-a2ec-4af0-a2ec-18c308132e21'},
'HTTPStatusCode': 200,
'RequestId': 'ca5e4af0-a10d-a2ec-a13d-453dfsdfsds',
'RetryAttempts': 0}}
And I can see the table on glue but when I try to query the table on Athena I have the error :
describe formatted test_table
Your query has the following error(s):
FAILED: SemanticException Unable to fetch table test_table. Unable to get table: java.lang.NullPointerException
What am I doing wrong ? Also I just checked on glue, even though the entry of the table is created I have no information returned from the interface.
Upvotes: 0
Views: 3956
Reputation: 12939
Working directly with Glue API using boto3 is sometimes too verbose and missing a parameter or two can cause a critical error. Before fighting the API details, I recommend you to take a look at AWS Wrangler.
With a couple of commands, you can read the data into Pandas dataframe and then create a table with your preferred structure:
import awswrangler as wr
df = wr.pandas.read_parquet(path='s3://staging/tables/test_table/version=2020-03-26',
columns=['country', 'city', ...], filters=[("c5", "=", 0)])
# Typical Pandas, Numpy or Pyarrow transformation HERE!
wr.pandas.to_parquet( # Storing the data and metadata to Data Lake
dataframe=df,
database='my_database',
path='s3://production/tables/test_table/version=2020-03-26',
partition_cols=['country', 'city'],
)
If you are using PySpark and want to register the data frame in Glue:
import awswrangler as wr
dataframe.write \
.mode("overwrite") \
.format("parquet") \
.partitionBy(["country", "city"]) \
.save(compression="gzip", path="s3://production/tables/test_table/version=2020-03-26.")
sess = wr.Session(spark_session=spark)
sess.spark.create_glue_table(
dataframe=dataframe,
file_format="parquet",
partition_by=["country", "city"],
path="s3://production/tables/test_table/version=2020-03-26",
compression="gzip",
database="my_database")
Upvotes: 1