Reputation: 313
I have a csv, that is not quoted, have added an example below
New lines are escaped with \, as shown in the 2nd row, is there a way to replace that with some other character using apache spark..
Input CSV
Banana,23,Male,5,11,2017
Cat,32,Fe\
male,2,11,2017
Dragon,28,Male,1,11,2017
Expected Output
Banana,23,Male,5,11,2017
Cat,32,Fe-male,2,11,2017
Dragon,28,Male,1,11,2017
Note: the original file is huge (around 40GB)
Edit 1 I just found an answer to use "sc. wholeTextFiles" instead of "sc.textFile", but given the big size I m not sure if it is memory efficient, please advise
Upvotes: 1
Views: 5612
Reputation: 313
After some research and playing around this is what i came to
as @vikrant-rana suggested in the answer, reading with sc.textFile() and doing a map on the partitions is one way to try, but as the row we need to merge may go to different partition, this is not a reliable solution.. This may work sometimes when they fall on same partition, but will not work always
we can alternatively to use sc.wholeTextFiles() to read the file into single partition and do map over it, but that would read the whole file at once into the memory and is not suitable for huge files
Upvotes: 2
Reputation: 4674
Here you go!
Python function:
def my_func(lista):
new="\n".join(lista).replace("\\\n", "-").splitlines()
return new
Call this function for your rdd partitions:
>>> newrdd = sc.textFile(PATH_TO_FILE).mapPartitions(my_func).map(lambda line : line.split(","))
>>> newrdd.take(5);
[[u'Banana', u'23', u'Male', u'5', u'11', u'2017'], [u'Cat', u'32', u'Fe-male', u'2', u'11', u'2017'], [u'Dragon', u'28', u'Male', u'1', u'11', u'2017']]
newdf = sc.textFile(PATH_TO_FILE).mapPartitions(my_func).map(lambda line : line.split(",")).toDF(['Col1','Col2','Col3','Col4','Col5','Col5'])
>>> newdf.show();
+------+----+-------+----+----+----+
| Col1|Col2| Col3|Col4|Col5|Col5|
+------+----+-------+----+----+----+
|Banana| 23| Male| 5| 11|2017|
| Cat| 32|Fe-male| 2| 11|2017|
|Dragon| 28| Male| 1| 11|2017|
+------+----+-------+----+----+----+
I guess you are expecting solution similar to this. Let me know if you were looking something else. I can tweak my function as per the requirements :-)
Upvotes: 2