Ravi Ranjan
Ravi Ranjan

Reputation: 353

requests.get() not crawling entire common crawl records for a given warc path

i have implemented https://dmorgan.info/posts/common-crawl-python/ as described in this link. However, I want to crawl entire data rather than partial data unlike as described in this post. So, in this code chunk,

def get_partial_warc_file(url, num_bytes=1024 * 10):
with closing(requests.get(url, stream=True)) as r:
    buf = StringIO(r.raw.read(num_bytes))
return warc.WARCFile(fileobj=buf, compress=True)

I have made the following change:

def get_partial_warc_file(url):
with closing(requests.get(url, stream=True)) as r:
    buf = StringIO(r.raw.data)
return warc.WARCFile(fileobj=buf, compress=True)

This code chunk increases the number of records for a given warc path but it does not crawl entire number of records. I can't find a possible reason for the same. Any help would be appreciated.

Upvotes: 1

Views: 291

Answers (0)

Related Questions