Reputation: 760
Seems spark is not able to escape characters in CSV files that are not enclosed by quotes, for example,
Name,Age,Address,Salary
Luke,24,Mountain View\,CA,100
I am using pyspark, the following code apparently won't work with the comma inside Address field.
df = spark.read.csv(fname, schema=given_schema,
sep=',', quote='',mode="FAILFAST")
Any suggestions?
Upvotes: 1
Views: 1519
Reputation: 4674
Could you please give a try using rdd first, reformat it and then create a dataframe over it.
df = sc.textFile(PATH_TO_FILE) \
.map(lambda x: x.replace("\\," ,"|")) \
.mapPartitions(lambda line: csv.reader(line,delimiter=','))\
.filter(lambda line: line[0] != 'Name') \
.toDF(['Name','Age','Address','Salary'])
this is how your dataframe looks like now:
>>> df.show();
+----+---+----------------+------+
|Name|Age| Address|Salary|
+----+---+----------------+------+
|Luke| 24|Mountain View|CA| 100|
+----+---+----------------+------+
I have to replace address column "\," with "|" and then I splitted the data using delimiter ','. Not sure how it matches your requirement but it's working.
Upvotes: 2