fass33443423
fass33443423

Reputation: 117

Querying athena aws the right way

i get a time out queriing https://commoncrawl.org/overview data with athena ... and if it succeed it will cost me 1000$ each query ... 5$ for each TB with 200 TB (?) ... actually too much

This is, what I do :

CREATE DATABASE CommonData20102024;



CREATE EXTERNAL TABLE IF NOT EXISTS CommonData20102024.commoncrawl_warc (
  WARC_Type           STRING,
  WARC_Date           STRING,
  WARC_Record_ID      STRING,
  Content_Length      INT,
  WARC_Concurrent_To  STRING,
  Content_Type        STRING,
  WARC_Block_Digest   STRING,
  WARC_Payload_Digest STRING,
  WARC_IP_Address     STRING,
  WARC_Refers_To      STRING,
  WARC_Target_URI     STRING,
  WARC_Truncated      STRING,
  WARC_Warcinfo_ID    STRING,
  WARC_Filename       STRING,
  WARC_Profile        STRING,
  WARC_Identified_Payload_Type STRING,
  Payload             STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "..."
)
LOCATION 's3://commoncrawl/crawl-data/CC-MAIN-2024-38/';


SELECT WARC_Target_URI
FROM ccrawl_db.commoncrawl_warc
WHERE lower(WARC_Target_URI) LIKE '%.de%'

My question: is it the right way accessing that data? I just want to get the urls with german tld

Upvotes: 0

Views: 80

Answers (1)

fass33443423
fass33443423

Reputation: 117

Following the tutorial, helped me to find the answer:

https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format

The following SQL-Statements I used:

CREATE DATABASE ccindex

CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
  url_surtkey                   STRING,
  url                           STRING,
  url_host_name                 STRING,
  url_host_tld                  STRING,
  url_host_2nd_last_part        STRING,
  url_host_3rd_last_part        STRING,
  url_host_4th_last_part        STRING,
  url_host_5th_last_part        STRING,
  url_host_registry_suffix      STRING,
  url_host_registered_domain    STRING,
  url_host_private_suffix       STRING,
  url_host_private_domain       STRING,
  url_protocol                  STRING,
  url_port                      INT,
  url_path                      STRING,
  url_query                     STRING,
  fetch_time                    TIMESTAMP,
  fetch_status                  SMALLINT,
  content_digest                STRING,
  content_mime_type             STRING,
  content_mime_detected         STRING,
  content_charset               STRING,
  content_languages             STRING,
  warc_filename                 STRING,
  warc_record_offset            INT,
  warc_record_length            INT,
  warc_segment                  STRING)
PARTITIONED BY (
  crawl                         STRING,
  subset                        STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';

MSCK REPAIR TABLE ccindex

select url
    from ccindex
   where crawl = 'CC-MAIN-2024-38'
     and subset = 'warc'
     and url_host_tld = 'de';

Upvotes: 1

Related Questions