Reputation: 860
The file is 20 GB, the line ending char is ␀. Below are PySpark code:
text_file = sc.textFile(file_name)
counts = text_file.flatMap(lambda line: line.split("␀"))
counts.count()
The error as below: Too many bytes before newline: 2147483648
Question: How to use PySpark read in a big customized line ending file?
Upvotes: 0
Views: 661
Reputation: 35229
You can use the same technique as in creating spark data structure from multiline record
rdd = sc.newAPIHadoopFile(
'/tmp/weird',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '␀'}
).values()
Upvotes: 1