Reputation: 79

How to create a table in AWS Athena from multiple CSVs by column names and not by column order

I want to create a table in AWS Athena from multiple CSV files stored in S3.

The CSVs have a header row with column names. My problem is that the columns are in a different order in each CSV, and I want to get the columns by their names.

When I try the normal CREATE TABLE in Athena, I get the first two columns.

CREATE EXTERNAL TABLE `test`(
  `id` string, 
  `name` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  'escapeChar'='\\', 
  'quoteChar'='\"', 
  'separatorChar'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://...'
TBLPROPERTIES (
  'has_encrypted_data'='false')

here's an example:

csv 1:

+----+-------+-------+---------+
| id | name  | price | comment |
+----+-------+-------+---------+
|  1 | shirt |   123 | abc     |
|  2 | shoes |   222 | ddd     |
+----+-------+-------+---------+

csv 2:

+----+------+-------+-------+---------+
| id | size | price | color |  name   |
+----+------+-------+-------+---------+
|  5 | L    |   100 | red   | shirt   |
|  6 | S    |    55 | white | t-shirt |
+----+------+-------+-------+---------+

The table I want:

+----+---------+
| id |  name   |
+----+---------+
|  1 | shirt   |
|  2 | shoes   |
|  5 | shirt   |
|  6 | t-shirt |
+----+---------+

The table I get:

+----+-------+
| id | name  |
+----+-------+
|  1 | shirt |
|  2 | shoes |
|  5 | L     |
|  6 | S     |
+----+-------+

Thanks

Upvotes: 4

Answers (4)

ryantuck

Reputation: 6684

IMO, it's kinda nuts that the Glue Crawler doesn't have a setting to pick up column names and use them to define the schema of a table. We ran into this issue (schema change in the same folder in S3) and here is how we solved it.

Note - the below solution works if you can map a schema (header order) to particular S3 paths.

Source data

We have four files. a.csv and b.csv share the same schema, whereas c.csv and d.csv have different schemas.

$ cat a.csv
a,b
1,2
3,4
$ cat b.csv
a,b
5,6
3,4
$ cat c.csv
a,b,c
1,2,3
4,5,6
$ cat d.csv
a,c,d,x
6,7,8,9
1,2,3,4

These are saved in S3:

$ aws s3 ls s3://example-s3-bucket/
2019-01-04 09:47:42         12 a.csv
2019-01-04 09:49:49         12 b.csv
2019-01-04 09:49:53         18 c.csv
2019-01-04 09:49:56         24 d.csv

Create one table per schema

Create one table per schema, simply passing the same S3 location in.

Note here that I'm leaving out the delimiter and field separator definitions for brevity.

create external table athena_testing_ab (
  a int,
  b int
)
LOCATION 's3://example-s3-bucket/'
;

create external table athena_testing_c (
  a int,
  b int,
  c int
)
LOCATION 's3://example-s3-bucket/'
;

create external table athena_testing_d (
  a int,
  c int,
  d int,
  x int
)
LOCATION 's3://example-s3-bucket/'
;

Query all tables using `UNION`s

We now query these 3 tables and UNION them all together, filtering on the appropriate S3 paths for each table.

You will probably want to use regex or substring parsing to more elegantly filter on $PATH, especially if you have hundreds or thousands of files in your bucket.

select
  a,
  b,
  null as c,
  null as d,
  null as x
from
  athena_testing_ab
where "$PATH" in  ('s3://example-s3-bucket/a.csv', 's3://example-s3-bucket/b.csv')

union all

select
  a,
  b,
  c,
  null as d,
  null as x
from
  athena_testing_c
where "$PATH" in  ('s3://example-s3-bucket/c.csv')

union all

select
  a,
  null as b,
  c,
  d,
  x
from
  athena_testing_d
where "$PATH" in  ('s3://example-s3-bucket/d.csv')

Upvotes: 5