lorenzofeliz
lorenzofeliz

Reputation: 607

How can one extract every payload from warc.wet.gz?

I have been trying to extract the text data from Common Crawl's wet files. I am currently using warc parser by Internet Archieve https://github.com/internetarchive/warc

import warc
w = warc.open(fileName)
for record in w:
  text = record.payload.read()

But this method gives less than half data that is there in payload. Is there any other better method which can give all the data that is there in each of the payload in a file.

Upvotes: 3

Views: 2901

Answers (2)

Chuck_Berry
Chuck_Berry

Reputation: 29

import requests
from warcio.archiveiterator import ArchiveIterator
from requests_file import FileAdapter
def print_f_records(fpath,f_out,num):
    record_count = 0
    conversion_count = 0
    header_count = 0
    splitter = "://"
    mx = num
    s = requests.session()
    s.mount('file://',FileAdapter())
    resp = s.get(fpath, stream=True)
    for record in ArchiveIterator(resp.raw, arc2warc=True):
        record_count +=1
        if record.rec_type == 'warcinfo':
            print(record.raw_stream.read())
        elif record.rec_type == 'conversion':
            conversion_count +=1
            if record.raw_stream.read is not None:
                if record.rec_headers.get_header('Content-Type') == 'text/plain':
                    header_count +=1
                    if header_count > mx:
                        continue
                    print("\n text/plain header no: ", header_count)
                    prefix = f"a{header_count}_"        # filenames will begin with a<header_count>
                    fname1 = (record.rec_headers.get_header('WARC-Target-URI'))
                    fname = fname1.split(splitter)[1]   # get the name part which follows ://
                    fname = prefix + fname[:22] + '.txt'    # keep entire file name to 32 characters or less
                    fname = fname.replace("/", "_")
                    content = record.content_stream().read()
                    f = open(fname, 'wb')               # note the wb; write in binay mode because the content is characters
                    f.write(content)
                    f.close
                    line = f'{fname1},{fname}' + "\n" # keep a record of the url translations
                    f_out.write(line)
                    print("created text file: ", fname)
                    if header_count == mx:
                        print(f"\n... completed {mx} txt files ... \n")
    print("Number of records: ", record_count)
    print("Number of conversion types: ", conversion_count)
    print("Number of text/plain http_headers: ", header_count)
    print("\nFINISHED")
warc_wet_gz ='file:////home/psl/CCrawl/text_extraction/text_from_wet/CC-MAIN-20230928063033-20230928093033-00593.warc.wet.gz'
print(f'\n warc.wet.gz file is {warc_wet_gz}')
num=100
f_out=open("wet_list_out.txt", 'w')
print_f_records(warc_wet_gz, f_out, num)
f_out.close()

Upvotes: -1

Tianya Chen
Tianya Chen

Reputation: 1

The warc library has a bug with its gzip handling which causes warc fail to read the entire WET file. To overcome the bug, you should use Python's gzip library to decompress the file stream on the fly as below:

import gzip
import warc
gzip_fobj = gzip.open(wet_file, "r")
warc_fobj = warc.WARCFile(fileobj=gzip_fobj, compress=False)

Upvotes: 0

Related Questions