PySpark read in a big customized line ending file

The file is 20 GB, the line ending char is ␀. Below are PySpark code:

text_file = sc.textFile(file_name)
counts = text_file.flatMap(lambda line: line.split("␀"))
counts.count()

The error as below: Too many bytes before newline: 2147483648

Question: How to use PySpark read in a big customized line ending file?

Upvotes: 0

Answers (1)

Alper t. Turker

Reputation: 35229

You can use the same technique as in creating spark data structure from multiline record

rdd = sc.newAPIHadoopFile(
    '/tmp/weird',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': '␀'}
).values()

Upvotes: 1

PySpark read in a big customized line ending file

Answers (1)

Related Questions