Reputation: 79
I want to create a table in AWS Athena from multiple CSV files stored in S3.
The CSVs have a header row with column names. My problem is that the columns are in a different order in each CSV, and I want to get the columns by their names.
When I try the normal CREATE TABLE in Athena, I get the first two columns.
CREATE EXTERNAL TABLE `test`(
`id` string,
`name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'quoteChar'='\"',
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://...'
TBLPROPERTIES (
'has_encrypted_data'='false')
here's an example:
csv 1:
+----+-------+-------+---------+
| id | name | price | comment |
+----+-------+-------+---------+
| 1 | shirt | 123 | abc |
| 2 | shoes | 222 | ddd |
+----+-------+-------+---------+
csv 2:
+----+------+-------+-------+---------+
| id | size | price | color | name |
+----+------+-------+-------+---------+
| 5 | L | 100 | red | shirt |
| 6 | S | 55 | white | t-shirt |
+----+------+-------+-------+---------+
The table I want:
+----+---------+
| id | name |
+----+---------+
| 1 | shirt |
| 2 | shoes |
| 5 | shirt |
| 6 | t-shirt |
+----+---------+
The table I get:
+----+-------+
| id | name |
+----+-------+
| 1 | shirt |
| 2 | shoes |
| 5 | L |
| 6 | S |
+----+-------+
Thanks
Upvotes: 4
Views: 5942
Reputation: 6684
IMO, it's kinda nuts that the Glue Crawler doesn't have a setting to pick up column names and use them to define the schema of a table. We ran into this issue (schema change in the same folder in S3) and here is how we solved it.
Note - the below solution works if you can map a schema (header order) to particular S3 paths.
We have four files. a.csv
and b.csv
share the same schema, whereas c.csv
and d.csv
have different schemas.
$ cat a.csv
a,b
1,2
3,4
$ cat b.csv
a,b
5,6
3,4
$ cat c.csv
a,b,c
1,2,3
4,5,6
$ cat d.csv
a,c,d,x
6,7,8,9
1,2,3,4
These are saved in S3:
$ aws s3 ls s3://example-s3-bucket/
2019-01-04 09:47:42 12 a.csv
2019-01-04 09:49:49 12 b.csv
2019-01-04 09:49:53 18 c.csv
2019-01-04 09:49:56 24 d.csv
Create one table per schema, simply passing the same S3 location in.
Note here that I'm leaving out the delimiter and field separator definitions for brevity.
create external table athena_testing_ab (
a int,
b int
)
LOCATION 's3://example-s3-bucket/'
;
create external table athena_testing_c (
a int,
b int,
c int
)
LOCATION 's3://example-s3-bucket/'
;
create external table athena_testing_d (
a int,
c int,
d int,
x int
)
LOCATION 's3://example-s3-bucket/'
;
UNION
sWe now query these 3 tables and UNION
them all together, filtering on the appropriate S3 paths for each table.
You will probably want to use regex or substring parsing to more elegantly filter on $PATH
, especially if you have hundreds or thousands of files in your bucket.
select
a,
b,
null as c,
null as d,
null as x
from
athena_testing_ab
where "$PATH" in ('s3://example-s3-bucket/a.csv', 's3://example-s3-bucket/b.csv')
union all
select
a,
b,
c,
null as d,
null as x
from
athena_testing_c
where "$PATH" in ('s3://example-s3-bucket/c.csv')
union all
select
a,
null as b,
c,
d,
x
from
athena_testing_d
where "$PATH" in ('s3://example-s3-bucket/d.csv')
Upvotes: 5
Reputation: 6385
I would go with two different tables for different CSVs (you will need to store your CSVs in different folders).
Eventually, to get id,name structure for both CSV I’d go with VIEW that union necessary columns from different tables.
Upvotes: 2
Reputation: 946
Place the files in different folders and use the Glue Crawler to create the datastore.
Upvotes: 0