Reputation: 482
I have a scenario, where I have two tables - Table1 and Table2 getting created using crawlers with source as parquet inside my s3 prefix - s3://bucket_name/my_tables/table1/data.parquet, s3://bucket_name/my_tables/table2/data.parquet.
These table have following columns - Table1 - id, type, name, address Table2 - id, type, officenumber, designation
Consider type of all columns as string here.
In athena - Now I want to create a single table called - MyTables with all the columns in the two tables above combined. Basically, I ran a query to create this table manually specifying the source as - s3://bucket_name/my_tables
The table gets created but it is empty. How do I achieve this without duplicating records in S3 and utilizing the same files that were used to create Table1 and Table2. After it is just a table with all columns present in these two tables.
I was able to load partitions individually. First I ran an SQL query to create a table in Athena. And then I added partitions using ALTER PARTITIONS query for all the date partitions.
Is there a way to load the data using single query? Also, if new date partitions gets added, then do I have to add those partitions manually again?
Upvotes: 0
Views: 875
Reputation: 132902
If I understand you correctly you have these two tables:
Table1
======
id | type | name | address
---+------+------+--------
i1 | t1 | n1 | a1
i2 | t2 | n2 | a2
and
Table2
======
id | type | officenumber | designation
---+------+--------------+------------
i3 | t3 | o3 | d3
i4 | t4 | o4 | d4
and you want to get
id | type | name | address | officenumber | designation
---+------+------+---------+--------------+------------
i1 | t1 | n1 | a1 | NULL | NULL
i2 | t2 | n2 | a2 | NULL | NULL
i3 | t3 | NULL | NULL | o3 | d3
i4 | t4 | NULL | NULL | o4 | d4
i.e. it's not a join you're after but a kind of union of the rows and schemas.
The more generic solution to this is to create a view with a union query:
CREATE OR REPLACE VIEW union_view AS
SELECT id, type, name, NULL AS officenumber, NULL AS designation FROM Table1
UNION ALL
SELECT id, type, NULL AS name, NULL AS address, officenumber, designation FROM Table2
Another way you can do this is by creating a new table with a LOCATION
pointing to the shared parent of the two tables' LOCATION
s (assuming there is no other data under that prefix). This will work because Athena maps table columns to the columns in Parquet files by name, and if a file is missing a column Athena will fill with NULL
(this is not true for all data formats, but it is for Parquet and JSON):
CREATE EXTERNAL TABLE union_table (
id STRING,
type STRING,
name STRING,
address STRING,
officenumber STRING,
designation STRING
)
STORED AS PARQUET
LOCATION 's3://bucket_name/my_tables/'
This sounds like what you tried to do when you got an empty table. The reason that happened was probably because you created a partitioned table and didn't add partitions.
One thing to ask yourself is: do you need a partitioned table? If not, just go with the SQL above.
If you need partitions I recommend using Partition Projection and to not rely on crawlers. Crawlers are ok for discovering a schema during development, but they are slow and expensive compared to Partition Projection, and a little bit dangerous to use in production. There is a command that comes up often in the documentation: MSCK REPAIR TABLE
, but this won't work for you because your data is not using Hive style partition naming (but no great loss, it's an inefficient way of managing partitons and you should really use Partition Projection if you need a partitioned table).
Upvotes: 1