Jack P
Jack P

Reputation: 1

Converting warc.gz to .warc

My attempt to extract a warc.gz file, using gzip, resulted in a WARC, but it won't load in http://replayweb.page.

Extracting it using The Unarchiver gave me all the expanded html and other files.

What is the latest recommended method for converting warc.gz to warc? For some reason I am coming up short in my attempts to find suggestions for this simple task.

Thanks!

Upvotes: 0

Views: 778

Answers (2)

Shrinks99
Shrinks99

Reputation: 3

After I tried warc2warc to no success I created the following small python script to accomplish this task. Seems to work reasonably well!

Usage: python warcgz-to-warc compressed.warc.gz -o output.warc

import argparse
import gzip
import shutil
import os

def convert_warc(input_file_path, output_file_path=None):
    if output_file_path is None:
        output_file_path = os.path.splitext(input_file_path)[0]

    with gzip.open(input_file_path, 'rb') as f_in:
        with open(output_file_path, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert a WARC file compressed with gzip to a WARC file.')
    parser.add_argument('input_file_path', help='The path to the input WARC file.')
    parser.add_argument('-o', '--output_file_path', help='The path to the output WARC file. If not provided, the output file will have the same name as the input file with the ".gz" extension removed.')

    args = parser.parse_args()

    convert_warc(args.input_file_path, args.output_file_path)

Upvotes: 0

Ivan Begtin
Ivan Begtin

Reputation: 29

The programming way is using "warcio" python lib, command-line way is using "warc2warc" utility from warctools.

Upvotes: 0

Related Questions