Reputation: 2944
I'm new into AWS Athena, and I'm trying to query multiple S3 buckets containing JSON files. I encountered a number of problems that don't have any answer in documentation (sadly their error log is not informative enough to try to solve it myself):
CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table ( `device`: string, `Capacity(GB)`: string)
Your query has the following error(s):
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: Error: : expected at the position of 'Capacity(GB):string>' but '(' is found.
My files are located in sub folders in S3 in a following structure:
'location_name/YYYY/MM/DD/appstring/'
and I want to query all the dates of a specific app-string (out of many). is there any 'wildcard' I can use to replace the dates path? Something like this:
LOCATION 's3://location_name/%/%/%/appstring/'
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table ( field1:string, field2:string ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) LOCATION 's3://folder/YYYY/MM/DD/appstring' WHERE field2='value'
What would be the outcomes in terms of billing? Cause right now I'm building this CREATE statement only to re-use the data in a SQL query once-again.
Thanks!
Upvotes: 3
Views: 1819
Reputation: 269480
1. JSON field named with parenthesis
There is no need to create a field called Capacity(GB)
. Instead, create the field with a different name:
CREATE EXTERNAL TABLE test_table (
device string,
capacity string
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';
If you are using Nested JSON then you can use the Serde's mapping
property (which I saw on issue with Hive Serde dealing nested structs):
CREATE external TABLE test_table (
top string,
inner struct<device:INT,
capacity:INT>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';
This works nicely with an input of:
{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}
2. Subfolders
You cannot wildcard mid-path (s3://location_name/*/*/*/appstring/
). The closest option is to use partitioned data but that would require a different naming format for your directories.
3. Creating tables
You cannot specify WHERE
statements as part of the CREATE TABLE
statement.
If your aim is to reduce data costs, then use partitioned data to reduce the number of files scanned or store in a column-based format such as Parquet.
For examples, see: Analyzing Data in S3 using Amazon Athena
Upvotes: 2