Create table on Athena only from certain S3 files (depending on filename)

Question

I'm trying to create a table on Athena from S3 files.
In my bucket, I have different types of files (Activity, Epoch, BodyComp, etc.) and I'd like this table to contain only "Activity" files assuming their filenames are like :
"Activity__xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx__yyyyyyyyyy.json"
where :
- x is a character or a digit
- y is a digit

I can do that after creating the table with this SELECT statement but the query takes too much time:

SELECT *, regexp_extract("$path", '[^/]+$') AS filename
FROM runs
WHERE regexp_extract("$path", '[^/]+$') like 'Activity__%';

I'd like to do it directly in the CREATE TABLE statement.
I tried this with "input.regex" but it didn't work :

CREATE EXTERNAL TABLE IF NOT EXISTS runs(
  summaryId string,
  distanceInMeters float,
  maxHeartRateInBeatsPerMinute int,
  totalElevationGainInMeters float,
  userAccessToken string
)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ('serialization.format' = '1',
                      "input.regex" = "^Activity\_\_\w{8}-\w{4}-\w{4}-\w{4}-\w{12}\_\_\d{10}\.json") 
LOCATION 's3://com.connector/' 
TBLPROPERTIES ('has_encrypted_data'='false');

I think the problem comes from the fact that "input.regex" is not the correct parameter to get the filenames.

Thank you for your help,
Max

Create table on Athena only from certain S3 files (depending on filename)

Answers (1)

Related Questions