Spark reading WARC file with custom InputFormat

Question

I need to process a .warc file through Spark but I can't seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD through wholeTextFiles() (because the whole file would be processed at a single node(?)) therefore it seems like the only/best way is through a custom Hadoop InputFormat used with .hadoopFile() in Python.

However, I could not find an easy way of doing this. To split a .warc file into entries is as simple as splitting on ; so how can I achieve this, without writing a ton of extra (useless) code as shown in various "tutorials" online? Can it be done all in Python?

i.e., How to split a warc file into entries without reading the whole thing with wholeTextFiles?

9b428a28 · Accepted Answer

If delimiter is you can use textinputformat.record.delimiter

sc.newAPIHadoopFile(
  path ,
  'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
  'org.apache.hadoop.io.LongWritable',
  'org.apache.hadoop.io.Text',
  conf={'textinputformat.record.delimiter': '


'}
)

Spark reading WARC file with custom InputFormat

Answers (1)

Related Questions