Dimebag
Dimebag

Reputation: 843

Spark reading WARC file with custom InputFormat

I need to process a .warc file through Spark but I can't seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD through wholeTextFiles() (because the whole file would be processed at a single node(?)) therefore it seems like the only/best way is through a custom Hadoop InputFormat used with .hadoopFile() in Python.

However, I could not find an easy way of doing this. To split a .warc file into entries is as simple as splitting on \n\n\n; so how can I achieve this, without writing a ton of extra (useless) code as shown in various "tutorials" online? Can it be done all in Python?

i.e., How to split a warc file into entries without reading the whole thing with wholeTextFiles?

Upvotes: 3

Views: 720

Answers (1)

9b428a28
9b428a28

Reputation: 121

If delimiter is \n\n\n you can use textinputformat.record.delimiter

sc.newAPIHadoopFile(
  path ,
  'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
  'org.apache.hadoop.io.LongWritable',
  'org.apache.hadoop.io.Text',
  conf={'textinputformat.record.delimiter': '\n\n\n'}
)

Upvotes: 3

Related Questions