FrozenSoul90
FrozenSoul90

Reputation: 313

How to replace escaped newline in spark

I have a csv, that is not quoted, have added an example below

New lines are escaped with \, as shown in the 2nd row, is there a way to replace that with some other character using apache spark..

Input CSV

Banana,23,Male,5,11,2017
Cat,32,Fe\
male,2,11,2017
Dragon,28,Male,1,11,2017

Expected Output

Banana,23,Male,5,11,2017
Cat,32,Fe-male,2,11,2017
Dragon,28,Male,1,11,2017

Note: the original file is huge (around 40GB)

Edit 1 I just found an answer to use "sc. wholeTextFiles" instead of "sc.textFile", but given the big size I m not sure if it is memory efficient, please advise

Upvotes: 1

Views: 5612

Answers (2)

FrozenSoul90
FrozenSoul90

Reputation: 313

After some research and playing around this is what i came to

as @vikrant-rana suggested in the answer, reading with sc.textFile() and doing a map on the partitions is one way to try, but as the row we need to merge may go to different partition, this is not a reliable solution.. This may work sometimes when they fall on same partition, but will not work always

we can alternatively to use sc.wholeTextFiles() to read the file into single partition and do map over it, but that would read the whole file at once into the memory and is not suitable for huge files

Upvotes: 2

vikrant rana
vikrant rana

Reputation: 4674

Here you go!

Python function:

def my_func(lista):
    new="\n".join(lista).replace("\\\n", "-").splitlines()
    return new

Call this function for your rdd partitions:

>>> newrdd = sc.textFile(PATH_TO_FILE).mapPartitions(my_func).map(lambda line : line.split(","))
>>> newrdd.take(5);
[[u'Banana', u'23', u'Male', u'5', u'11', u'2017'], [u'Cat', u'32', u'Fe-male', u'2', u'11', u'2017'], [u'Dragon', u'28', u'Male', u'1', u'11', u'2017']]

now loading this input file to the dataframe:

newdf  = sc.textFile(PATH_TO_FILE).mapPartitions(my_func).map(lambda line : line.split(",")).toDF(['Col1','Col2','Col3','Col4','Col5','Col5'])

>>> newdf.show();
+------+----+-------+----+----+----+
|  Col1|Col2|   Col3|Col4|Col5|Col5|
+------+----+-------+----+----+----+
|Banana|  23|   Male|   5|  11|2017|
|   Cat|  32|Fe-male|   2|  11|2017|
|Dragon|  28|   Male|   1|  11|2017|
+------+----+-------+----+----+----+

I guess you are expecting solution similar to this. Let me know if you were looking something else. I can tweak my function as per the requirements :-)

Upvotes: 2

Related Questions