Athena AWS bad field name and multiple folders with Hive DDL

Question

I'm new into AWS Athena, and I'm trying to query multiple S3 buckets containing JSON files. I encountered a number of problems that don't have any answer in documentation (sadly their error log is not informative enough to try to solve it myself):

How to query a JSON field named with parenthesis? For example I have a field named "Capacity(GB)", and when I'm trying to include in the CREATE EXTERNAL statement I receive an error:

   CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table (
  `device`: string,
  `Capacity(GB)`: string)
Your query has the following error(s):

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: Error: : expected at the position of 'Capacity(GB):string>' but '(' is found.

My files are located in sub folders in S3 in a following structure:

'location_name/YYYY/MM/DD/appstring/'

and I want to query all the dates of a specific app-string (out of many). is there any 'wildcard' I can use to replace the dates path? Something like this:

LOCATION 's3://location_name/%/%/%/appstring/'

Do I have to load the raw data as-is using CREATE EXTERNAL TABLE, and only then query it, or I can add some WHERE statements build-in? Specifically is someting like this is possible:

CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
  field1:string,
  field2:string
  )

ROW FORMAT SERDE  'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'

WHERE field2='value'

What would be the outcomes in terms of billing? Cause right now I'm building this CREATE statement only to re-use the data in a SQL query once-again.

Thanks!

John Rotenstein · Accepted Answer

1. JSON field named with parenthesis

There is no need to create a field called Capacity(GB). Instead, create the field with a different name:

CREATE EXTERNAL TABLE test_table (
    device string,
    capacity string
)
ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';

If you are using Nested JSON then you can use the Serde's mapping property (which I saw on issue with Hive Serde dealing nested structs):

CREATE external TABLE test_table (
   top string,
   inner struct
   )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';

This works nicely with an input of:

{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}

2. Subfolders

You cannot wildcard mid-path (s3://location_name/*/*/*/appstring/). The closest option is to use partitioned data but that would require a different naming format for your directories.

3. Creating tables

You cannot specify WHERE statements as part of the CREATE TABLE statement.

If your aim is to reduce data costs, then use partitioned data to reduce the number of files scanned or store in a column-based format such as Parquet.

For examples, see: Analyzing Data in S3 using Amazon Athena

Athena AWS bad field name and multiple folders with Hive DDL

Answers (1)

Related Questions