Reputation: 7298
I have started testing out AWS Athena, and it so far looks good. One problem I am having is about the updating of data in a table.
Here is the scenario: In order to update the data for a given date in the table, I am basically emptying out the S3 bucket that contains the CSV files, and uploading the new files to become the updated data source. However, the period of time during which the bucket is empty (i.e. when the old source is deleted and new source is being uploaded) actually is a bottleneck, because during this interval, anyone querying the table will get no result.
Is there a way around this?
Thanks.
Upvotes: 0
Views: 10805
Reputation: 230
What most people want to do is probably MSCK REPAIR TABLE <table_name>
.
This updates the metadata if you have added more files in the location, but it is only available if you table has partitions.
You might also want to do this with a Glue Crawler which can be scheduled to refresh the table with new data.
Upvotes: 0
Reputation: 2668
Athena is a web service that allows to query data which resides on AWS S3. In order to run queries, Athena needs to now table schema and where to look for data on S3. All this information is stored in AWS Glue Meta Data catalog. This essentially means that each time you get a new data you simply need to upload a new csv file onto S3.
Let's assume that you get new data everyday at midnight and you store them in an S3 bucket:
my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
└── data-file-2019-01-03.csv
and each of these files looks like:
| date | volume | product | price |
|------------|---------|---------|-------|
| 2019-01-01 | 100 | apple | 10 |
| 2019-01-01 | 200 | orange | 50 |
| 2019-01-01 | 50 | cherry | 100 |
Then after you have uploaded them to AWS S3 you can use the following DDL statement in order to define table
CREATE EXTERNAL TABLE `my_table`(
`date` timestamp,
`volume` int,
`product` string,
`price` double)
LOCATION
's3://my-s3-bucket/'
-- Additional table properties
Now when you get a new file data-file-2019-01-04.csv
and you upload it to the same location as other files, Athena would be able to query new data as well.
my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
├── data-file-2019-01-03.csv
└── data-file-2019-01-04.csv
If your scenario is when you need to updated data in the S3 bucket, then you can try to combine views, tables and keeping different versions of data
Let's say you have table_v1
that queries data in s3://my-data-bucket/v1/ location. You create a view for table_v1
which can be seen as a wrapper of some sort:
CREATE VIEW `my_table_view` AS
SELECT *
FROM `table_v1`
Now your users could use my_table
to query data in s3://my-data-bucket/v1/ instead of table_v1
. When you want to update data, you can simply upload it to s3://my-data-bucket/v2/ and define table table_v2
. Next, you need to update your my_table_view
view since all queries are run against it:
CREATE OR REPLACE VIEW `my_table_view` AS
SELECT *
FROM `table_v2`
After this is done, you can drop table_v1
and delete files from s3://my-data-bucket/v1/
. Provided that data schema hasn't changed, all queries that ran against my_table_view
view while it was based on table_v1
should still be valid and succeed after my_table_view
got replaced.
I don't know what would the downtime of replacing a view, but I'd expect it to less then a second, which is definitely less that the time it takes to upload new files.
Upvotes: 1