Ashraful Islam
Ashraful Islam

Reputation: 12830

Unescape comma when reading CSV with spark

I have a CSV file that contains name field with comma (,) escaped with \

id,name
"10","Ashraful\, Islam"

I am reading the csv file from pyspark

test = spark.read.format("csv").option("sep", ",").option("escape", "\\").option("inferSchema", "true").option("header", "true").load("test.csv")
test.show()

The name should be Ashraful, Islam, but getting output

+---+----------------+
| id|            name|
+---+----------------+
| 10|Ashraful\, Islam|
+---+----------------+

Upvotes: 1

Views: 2907

Answers (1)

Michail N
Michail N

Reputation: 3835

Simply use:

df = spark.read.csv('file:///mypath.../myFile.csv', sep=',', header=True)       
df.show()

This gives the output:

+---+---------------+
| id|           name|
+---+---------------+
| 10|Ashraful, Islam|
+---+---------------+

EDIT: I could not replicate your problem with the input file you have but if it persists you can solve it with a workaround. Simply replace "\," (or any other special character which is escaped) in the dataframe.

You can

from pyspark.sql.functions import *

df = spark.read.csv('file:///home/perfman/todel.csv', sep=',', header=True)
df.withColumn('nameClean', regexp_replace('name', '\\\,', ',')).show()

>>>
+---+----------------+---------------+
| id|            name|      nameClean|
+---+----------------+---------------+
| 10|Ashraful\, Islam|Ashraful, Islam|
+---+----------------+---------------+

Upvotes: 1

Related Questions