Reputation: 2045
I'm trying to do this query to obtain some data. The size of the files that I have on s3 in s3://my_datalake/my_table/year=2018/month=9/day=7/ is 1.1 TB and I have 10014 snappy.parquet Objects.
SELECT array_join(array_agg(distinct endpoint),',') as endpoints_all, count(endpoint) as count_endpoints
FROM my_datalake.my_table
WHERE year=2018 and month=09 and day=07
and ts between timestamp '2018-09-07 00:00:00' and timestamp '2018-09-07 23:59:59'
and status = '2'
GROUP BY domain, size, device_id, ip
But I got that error:
Query exhausted resources at this scale factor
(Run time: 6 minutes 41 seconds, Data scanned: 153.87GB)
I have partitions YEAR, MONTH, DAY and HOUR. How can I do this query? Can I do it with Amazon Athena or I need to use another tool?
The schema of my table is:
CREATE EXTERNAL TABLE `ssp_request_prueba`(
`version` string,
`adunit` string,
`adunit_original` string,
`brand` string,
`country` string,
`device_connection_type` string,
`device_density` string,
`device_height` string,
`device_id` string,
`device_type` string,
`device_width` string,
`domain` string,
`endpoint` string,
`endpoint_version` string,
`external_dfp_id` string,
`id_req` string,
`ip` string,
`lang` string,
`lat` string,
`lon` string,
`model` string,
`ncc` string,
`noc` string,
`non` string,
`os` string,
`osv` string,
`scc` string,
`sim_operator_code` string,
`size` string,
`soc` string,
`son` string,
`source` string,
`ts` timestamp,
`user_agent` string,
`status` string,
`delivery_network` string,
`delivery_time` string,
`delivery_status` string,
`delivery_network_key` string,
`delivery_price` string,
`device_id_original` string,
`tracking_limited` string,
`from_cache` string,
`request_price` string)
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://my_datalake/my_table'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1538747353')
Upvotes: 2
Views: 3146
Reputation: 1939
The problem is probably related to array_join and array_agg functions. I suppose that in this mentioned case the memory limit of node in Athena service has been exceeded. Probably Athena is not able to manage such amount of data in combination with these functions.
Upvotes: 2