Eric Bellet
Eric Bellet

Reputation: 2045

How to execute a query with Amazon Athena without exhausted resources?

I'm trying to do this query to obtain some data. The size of the files that I have on s3 in s3://my_datalake/my_table/year=2018/month=9/day=7/ is 1.1 TB and I have 10014 snappy.parquet Objects.

SELECT  array_join(array_agg(distinct endpoint),',') as endpoints_all, count(endpoint) as count_endpoints 
    FROM my_datalake.my_table  
    WHERE year=2018 and month=09 and day=07 
    and ts between timestamp '2018-09-07 00:00:00' and timestamp '2018-09-07 23:59:59'
    and status = '2'
    GROUP BY domain, size, device_id, ip

But I got that error:

Query exhausted resources at this scale factor

(Run time: 6 minutes 41 seconds, Data scanned: 153.87GB)

I have partitions YEAR, MONTH, DAY and HOUR. How can I do this query? Can I do it with Amazon Athena or I need to use another tool?

The schema of my table is:

   CREATE EXTERNAL TABLE `ssp_request_prueba`(
  `version` string, 
  `adunit` string, 
  `adunit_original` string, 
  `brand` string, 
  `country` string, 
  `device_connection_type` string, 
  `device_density` string, 
  `device_height` string, 
  `device_id` string, 
  `device_type` string, 
  `device_width` string, 
  `domain` string, 
  `endpoint` string, 
  `endpoint_version` string, 
  `external_dfp_id` string, 
  `id_req` string, 
  `ip` string, 
  `lang` string, 
  `lat` string, 
  `lon` string, 
  `model` string, 
  `ncc` string, 
  `noc` string, 
  `non` string, 
  `os` string, 
  `osv` string, 
  `scc` string, 
  `sim_operator_code` string, 
  `size` string, 
  `soc` string, 
  `son` string, 
  `source` string, 
  `ts` timestamp, 
  `user_agent` string, 
  `status` string, 
  `delivery_network` string, 
  `delivery_time` string, 
  `delivery_status` string, 
  `delivery_network_key` string, 
  `delivery_price` string, 
  `device_id_original` string, 
  `tracking_limited` string, 
  `from_cache` string, 
  `request_price` string)
PARTITIONED BY ( 
  `year` int, 
  `month` int, 
  `day` int, 
  `hour` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://my_datalake/my_table'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1538747353')

Upvotes: 2

Views: 3146

Answers (1)

jbgorski
jbgorski

Reputation: 1939

The problem is probably related to array_join and array_agg functions. I suppose that in this mentioned case the memory limit of node in Athena service has been exceeded. Probably Athena is not able to manage such amount of data in combination with these functions.

Upvotes: 2

Related Questions