Reputation: 353

Converting a warc.gz file downloaded from Common Crawl to an RDD

I have downloaded a warc.gz file from common crawl and I have to process it using spark. How can convert the file into an RDD?sc.textFile("filepath") does not seem to help. When rdd.take(1) is printed, it gives me [u'WARC/1.0'] whereas it should have given me an entire record. How can I convert the file into a processable rdd? Thanks!

Upvotes: 0

Answers (1)

Rishabh Wadhawan

Reputation: 21

You are getting that because RDD supports unstructured data. If you read a file as an RDD the warc structure is gone. Hence when you do rdd.take(1) which essentially means the first line of the RDD. Hence, the result [u'WARC/1.0']. If you want to process warc records. I wont recommend using spark as there is support for Warc files yet. Using python warc library should help you out with that as it would preserve the structure of your enriched WARC data.

Upvotes: 2

Converting a warc.gz file downloaded from Common Crawl to an RDD

Answers (1)

Related Questions