Querying athena aws the right way

Question

i get a time out queriing https://commoncrawl.org/overview data with athena ... and if it succeed it will cost me 1000$ each query ... 5$ for each TB with 200 TB (?) ... actually too much

This is, what I do :

CREATE DATABASE CommonData20102024;



CREATE EXTERNAL TABLE IF NOT EXISTS CommonData20102024.commoncrawl_warc (
  WARC_Type           STRING,
  WARC_Date           STRING,
  WARC_Record_ID      STRING,
  Content_Length      INT,
  WARC_Concurrent_To  STRING,
  Content_Type        STRING,
  WARC_Block_Digest   STRING,
  WARC_Payload_Digest STRING,
  WARC_IP_Address     STRING,
  WARC_Refers_To      STRING,
  WARC_Target_URI     STRING,
  WARC_Truncated      STRING,
  WARC_Warcinfo_ID    STRING,
  WARC_Filename       STRING,
  WARC_Profile        STRING,
  WARC_Identified_Payload_Type STRING,
  Payload             STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "..."
)
LOCATION 's3://commoncrawl/crawl-data/CC-MAIN-2024-38/';


SELECT WARC_Target_URI
FROM ccrawl_db.commoncrawl_warc
WHERE lower(WARC_Target_URI) LIKE '%.de%'

My question: is it the right way accessing that data? I just want to get the urls with german tld

Querying athena aws the right way

Answers (1)

Related Questions