Maxime Settembre
Maxime Settembre

Reputation: 21

Create table on Athena only from certain S3 files (depending on filename)

I'm trying to create a table on Athena from S3 files.
In my bucket, I have different types of files (Activity, Epoch, BodyComp, etc.) and I'd like this table to contain only "Activity" files assuming their filenames are like :
"Activity__xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx__yyyyyyyyyy.json"
where :
- x is a character or a digit
- y is a digit

I can do that after creating the table with this SELECT statement but the query takes too much time:

SELECT *, regexp_extract("$path", '[^/]+$') AS filename
FROM runs
WHERE regexp_extract("$path", '[^/]+$') like 'Activity__%';

I'd like to do it directly in the CREATE TABLE statement.
I tried this with "input.regex" but it didn't work :

CREATE EXTERNAL TABLE IF NOT EXISTS runs(
  summaryId string,
  distanceInMeters float,
  maxHeartRateInBeatsPerMinute int,
  totalElevationGainInMeters float,
  userAccessToken string
)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ('serialization.format' = '1',
                      "input.regex" = "^Activity\_\_\w{8}-\w{4}-\w{4}-\w{4}-\w{12}\_\_\d{10}\.json") 
LOCATION 's3://com.connector/' 
TBLPROPERTIES ('has_encrypted_data'='false');

I think the problem comes from the fact that "input.regex" is not the correct parameter to get the filenames.

Thank you for your help,
Max

Upvotes: 2

Views: 1164

Answers (1)

Prabhakar Reddy
Prabhakar Reddy

Reputation: 5124

There is no direct way of doing this. Either you can rename files starting with _(underscore) so that Athena will ignore them or use CTAS and pass the select query above.

Upvotes: 2

Related Questions