user3383348
user3383348

Reputation: 87

open warc file with python

I'm trying to open a warc file with python using the toolbox from the following link: http://warc.readthedocs.org/en/latest/

When opening the file with:

import warc
f = warc.open("00.warc.gz")

Everything is fine and the f object is:

<warc.warc.WARCFile instance at 0x1151d34d0>

However when I'm trying to read everything in the file using:

for record in f:
     print record['WARC-Target-URI'], record['Content-Length']

The following error appears:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 390, in         __iter__
record = self.read_record()
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
header = self.read_header(fileobj)
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
raise IOError("Bad version line: %r" % version_line)
IOError: Bad version line: 'WARC/0.18\n'

Is this because my warc file version is not supported by the warc toolbox I'm using or something else?

Upvotes: 6

Views: 2799

Answers (2)

yomin
yomin

Reputation: 551

Yes, thanks for @eyelash explanation about this problem.

Actually some records in Clueweb-09 are malformed. But the official warc library and the above recommended git repo warc-clueweb library both have some issues.

This fork repo could not handle Clueweb12 dataset and another issue is that it could miss 1-2 document when dealing every .warc.gz file.

So I've changed a little code to support both Clueweb09 and Cluewe12 datasets. Here is my repo which has been tested on 100 billion pages, my warc tools forked and changed from warc-clueweb library and official repo.

Upvotes: 0

eyelash
eyelash

Reputation: 76

ClueWeb09 dataset is available in the WARC 0.18 format. However, it has several issues. Some records are malformed.

The most prevalent problem is an extra newline in the WARC header. There are a few cases of other malformed headers also.

Moreover, it does not use the standard \r\n end-of-line markers which is actually your problem.

warc-clueweb library can handle it. This is a special python library to work with ClueWeb09 WARC files. According to documentation

Only minor modifications to the original library were made. The original documentation of the warc library still holds

Upvotes: 6

Related Questions