Reputation: 843
I need to process a .warc file through Spark but I can't seem to find a straightforward way of doing so. I would prefer to use Python and to not read the whole file into an RDD through wholeTextFiles()
(because the whole file would be processed at a single node(?)) therefore it seems like the only/best way is through a custom Hadoop InputFormat
used with .hadoopFile()
in Python.
However, I could not find an easy way of doing this. To split a .warc file into entries is as simple as splitting on \n\n\n
; so how can I achieve this, without writing a ton of extra (useless) code as shown in various "tutorials" online? Can it be done all in Python?
i.e., How to split a warc file into entries without reading the whole thing with wholeTextFiles
?
Upvotes: 3
Views: 720
Reputation: 121
If delimiter is \n\n\n
you can use textinputformat.record.delimiter
sc.newAPIHadoopFile(
path ,
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\n\n\n'}
)
Upvotes: 3