Reputation: 117
i get a time out queriing https://commoncrawl.org/overview data with athena ... and if it succeed it will cost me 1000$ each query ... 5$ for each TB with 200 TB (?) ... actually too much
This is, what I do :
CREATE DATABASE CommonData20102024;
CREATE EXTERNAL TABLE IF NOT EXISTS CommonData20102024.commoncrawl_warc (
WARC_Type STRING,
WARC_Date STRING,
WARC_Record_ID STRING,
Content_Length INT,
WARC_Concurrent_To STRING,
Content_Type STRING,
WARC_Block_Digest STRING,
WARC_Payload_Digest STRING,
WARC_IP_Address STRING,
WARC_Refers_To STRING,
WARC_Target_URI STRING,
WARC_Truncated STRING,
WARC_Warcinfo_ID STRING,
WARC_Filename STRING,
WARC_Profile STRING,
WARC_Identified_Payload_Type STRING,
Payload STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "..."
)
LOCATION 's3://commoncrawl/crawl-data/CC-MAIN-2024-38/';
SELECT WARC_Target_URI
FROM ccrawl_db.commoncrawl_warc
WHERE lower(WARC_Target_URI) LIKE '%.de%'
My question: is it the right way accessing that data? I just want to get the urls with german tld
Upvotes: 0
Views: 80
Reputation: 117
Following the tutorial, helped me to find the answer:
https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format
The following SQL-Statements I used:
CREATE DATABASE ccindex
CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
url_surtkey STRING,
url STRING,
url_host_name STRING,
url_host_tld STRING,
url_host_2nd_last_part STRING,
url_host_3rd_last_part STRING,
url_host_4th_last_part STRING,
url_host_5th_last_part STRING,
url_host_registry_suffix STRING,
url_host_registered_domain STRING,
url_host_private_suffix STRING,
url_host_private_domain STRING,
url_protocol STRING,
url_port INT,
url_path STRING,
url_query STRING,
fetch_time TIMESTAMP,
fetch_status SMALLINT,
content_digest STRING,
content_mime_type STRING,
content_mime_detected STRING,
content_charset STRING,
content_languages STRING,
warc_filename STRING,
warc_record_offset INT,
warc_record_length INT,
warc_segment STRING)
PARTITIONED BY (
crawl STRING,
subset STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';
MSCK REPAIR TABLE ccindex
select url
from ccindex
where crawl = 'CC-MAIN-2024-38'
and subset = 'warc'
and url_host_tld = 'de';
Upvotes: 1