How to replace escaped newline in spark

Question

I have a csv, that is not quoted, have added an example below

New lines are escaped with \, as shown in the 2nd row, is there a way to replace that with some other character using apache spark..

Input CSV

Banana,23,Male,5,11,2017
Cat,32,Fe\
male,2,11,2017
Dragon,28,Male,1,11,2017

Expected Output

Banana,23,Male,5,11,2017
Cat,32,Fe-male,2,11,2017
Dragon,28,Male,1,11,2017

Note: the original file is huge (around 40GB)

Edit 1 I just found an answer to use "sc. wholeTextFiles" instead of "sc.textFile", but given the big size I m not sure if it is memory efficient, please advise

FrozenSoul90 · Accepted Answer

After some research and playing around this is what i came to

as @vikrant-rana suggested in the answer, reading with sc.textFile() and doing a map on the partitions is one way to try, but as the row we need to merge may go to different partition, this is not a reliable solution.. This may work sometimes when they fall on same partition, but will not work always

we can alternatively to use sc.wholeTextFiles() to read the file into single partition and do map over it, but that would read the whole file at once into the memory and is not suitable for huge files

How to replace escaped newline in spark

Answers (2)

now loading this input file to the dataframe:

Related Questions