saifuddin778
saifuddin778

Reputation: 7298

AWS Athena Table Data Update

I have started testing out AWS Athena, and it so far looks good. One problem I am having is about the updating of data in a table.

Here is the scenario: In order to update the data for a given date in the table, I am basically emptying out the S3 bucket that contains the CSV files, and uploading the new files to become the updated data source. However, the period of time during which the bucket is empty (i.e. when the old source is deleted and new source is being uploaded) actually is a bottleneck, because during this interval, anyone querying the table will get no result.

Is there a way around this?

Thanks.

Upvotes: 0

Views: 10805

Answers (2)

Gomibushi
Gomibushi

Reputation: 230

What most people want to do is probably MSCK REPAIR TABLE <table_name>.

This updates the metadata if you have added more files in the location, but it is only available if you table has partitions.

You might also want to do this with a Glue Crawler which can be scheduled to refresh the table with new data.

Relevant documentation.

Upvotes: 0

Ilya Kisil
Ilya Kisil

Reputation: 2668

Athena is a web service that allows to query data which resides on AWS S3. In order to run queries, Athena needs to now table schema and where to look for data on S3. All this information is stored in AWS Glue Meta Data catalog. This essentially means that each time you get a new data you simply need to upload a new csv file onto S3.

Let's assume that you get new data everyday at midnight and you store them in an S3 bucket:

my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
└── data-file-2019-01-03.csv

and each of these files looks like:

| date       | volume  | product | price |
|------------|---------|---------|-------|
| 2019-01-01 | 100     | apple   | 10    |
| 2019-01-01 | 200     | orange  | 50    |
| 2019-01-01 | 50      | cherry  | 100   |

Then after you have uploaded them to AWS S3 you can use the following DDL statement in order to define table

CREATE EXTERNAL TABLE `my_table`(
  `date` timestamp, 
  `volume` int, 
  `product` string, 
  `price` double)
LOCATION
  's3://my-s3-bucket/'
-- Additional table properties

Now when you get a new file data-file-2019-01-04.csv and you upload it to the same location as other files, Athena would be able to query new data as well.

my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
├── data-file-2019-01-03.csv
└── data-file-2019-01-04.csv

Update 2019-09-19

If your scenario is when you need to updated data in the S3 bucket, then you can try to combine views, tables and keeping different versions of data

Let's say you have table_v1 that queries data in s3://my-data-bucket/v1/ location. You create a view for table_v1 which can be seen as a wrapper of some sort:

CREATE VIEW `my_table_view` AS
SELECT *
FROM `table_v1`

Now your users could use my_table to query data in s3://my-data-bucket/v1/ instead of table_v1. When you want to update data, you can simply upload it to s3://my-data-bucket/v2/ and define table table_v2. Next, you need to update your my_table_view view since all queries are run against it:

CREATE OR REPLACE VIEW `my_table_view` AS
SELECT *
FROM `table_v2`

After this is done, you can drop table_v1 and delete files from s3://my-data-bucket/v1/. Provided that data schema hasn't changed, all queries that ran against my_table_view view while it was based on table_v1 should still be valid and succeed after my_table_view got replaced.

I don't know what would the downtime of replacing a view, but I'd expect it to less then a second, which is definitely less that the time it takes to upload new files.

Upvotes: 1

Related Questions