Luke
Luke

Reputation: 760

Spark to parse backslash escaped comma in CSV files that are not enclosed by quotes

Seems spark is not able to escape characters in CSV files that are not enclosed by quotes, for example,

Name,Age,Address,Salary
Luke,24,Mountain View\,CA,100

I am using pyspark, the following code apparently won't work with the comma inside Address field.

df = spark.read.csv(fname, schema=given_schema,
                sep=',', quote='',mode="FAILFAST")

Any suggestions?

Upvotes: 1

Views: 1519

Answers (1)

vikrant rana
vikrant rana

Reputation: 4674

Could you please give a try using rdd first, reformat it and then create a dataframe over it.

df  = sc.textFile(PATH_TO_FILE) \
    .map(lambda x: x.replace("\\," ,"|")) \
    .mapPartitions(lambda line: csv.reader(line,delimiter=','))\
    .filter(lambda line: line[0] != 'Name') \
    .toDF(['Name','Age','Address','Salary'])

this is how your dataframe looks like now:

>>> df.show();
+----+---+----------------+------+
|Name|Age|         Address|Salary|
+----+---+----------------+------+
|Luke| 24|Mountain View|CA|   100|
+----+---+----------------+------+

I have to replace address column "\," with "|" and then I splitted the data using delimiter ','. Not sure how it matches your requirement but it's working.

Upvotes: 2

Related Questions