Reputation: 21
I'm trying to create a table on Athena from S3 files.
In my bucket, I have different types of files (Activity, Epoch, BodyComp, etc.) and I'd like this table to contain only "Activity" files assuming their filenames are like :
"Activity__xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx__yyyyyyyyyy.json"
where :
- x is a character or a digit
- y is a digit
I can do that after creating the table with this SELECT statement but the query takes too much time:
SELECT *, regexp_extract("$path", '[^/]+$') AS filename
FROM runs
WHERE regexp_extract("$path", '[^/]+$') like 'Activity__%';
I'd like to do it directly in the CREATE TABLE statement.
I tried this with "input.regex" but it didn't work :
CREATE EXTERNAL TABLE IF NOT EXISTS runs(
summaryId string,
distanceInMeters float,
maxHeartRateInBeatsPerMinute int,
totalElevationGainInMeters float,
userAccessToken string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1',
"input.regex" = "^Activity\_\_\w{8}-\w{4}-\w{4}-\w{4}-\w{12}\_\_\d{10}\.json")
LOCATION 's3://com.connector/'
TBLPROPERTIES ('has_encrypted_data'='false');
I think the problem comes from the fact that "input.regex" is not the correct parameter to get the filenames.
Thank you for your help,
Max
Upvotes: 2
Views: 1164
Reputation: 5124
There is no direct way of doing this. Either you can rename files starting with _
(underscore) so that Athena will ignore them or use CTAS
and pass the select query above.
Upvotes: 2