kartheek7895
kartheek7895

Reputation: 351

Retrieving records from WARC file based on url

I have to retrieve records from a *.warc.gz file based on Target-URI. The documentation says that this requires external CDXJ index files to be created.

I've tried opening the file as gzip.open() and do a seek(offset), but the seek operation is taking quite some time(seconds).

Is there any other correct way to retrieve the records.

Edit:I'm using warc python library and they don't seem to provide a direct f.seek() on the warc file.

Upvotes: 2

Views: 2006

Answers (1)

Sebastian Nagel
Sebastian Nagel

Reputation: 2239

You should do the seek on the file before decompressing. Usually, WARC files are compressed record by record and the offset and length in the CDXJ allow to clip out a single WARC record, then do a gzip.open() then on the single record. In doubt, better use a library. Warcio even provides a command-line tool to extract a single record by offset: warcio extract xyz.warc.gz offset.

Upvotes: 3

Related Questions