Reputation: 618
I have a 5 columns, c1, c2, c3, c4, c5
in source table (MySQL). I exported the table and added in S3 like:
s3://data/table_name/YYYYMMDD/table_nameYYYYMMDDHHMMSS.csv
And I exported the data for the 3 months of data.
In later stages, two new columns say c6
and c7
added. Now the exported files in s3://data/table_name/YYYYMMDD/table_nameYYYYMMDDHHMMSS.csv
will have 7 columns
And there is a chance that c4
column is being removed as well. So the exported files will have 6 columns, i.e. c1, c2, c3, c5, c6, c7
How the external table can be modelled to adapt these changes?
Upvotes: 2
Views: 2580
Reputation: 4354
Actually, if you use redshift spectrum / athena you may be OK!
Adding columns - is fine - just run the glue crawler and the new column will be available
deleting columns - is fine - the deleted columns will just appear as nulls
You cannot change a definition of a column.
I definitely recommend - if you can - to convert to parquet and to explore using partitions for your data.
Also, I recommend that you try it out - add the data one change at a time - re-run the glue crawler as necessary - and use athena to make sure all is ok.
Upvotes: 2
Reputation: 3465
You will need multiple external tables.
Since there is data drift, I think you should create separate tables for each column combinations you have and create a view (with union all
) on top them to display data.
So you'll have
table 1 : columns : c1, c2, c3, c4, c5
table 2 : columns : c1, c2, c3, c4, c5, c6, c7
table 3 : columns : c1, c2, c3, c5, c6, c7
view as
select c1, c2, c3, c4, c5, '' c6, '' c7 from table1 union all
select c1, c2, c3, c4, c5, c6, c7 from table2 union all
select c1, c2, c3, '' c4, c5, c6, c7 from table3
If there is another combination comes up you need to create a 4th table and update your view definition.
Upvotes: 1