Grevioos
Grevioos

Reputation: 405

How to read csv file with additional comma in quotes using pyspark?

I am having some troubles reading the following CSV data in UTF-16:

FullName, FullLabel, Type
TEST.slice, "Consideration":"Verde (Spar Verde, Fonte Verde)", Test,

As far as I understand, it should not be a problem for a reader, since there is a quote parameter to handle that.

df = spark.read.csv(file_path, header=True, encoding='UTF-16', quote = '"')

However, this would still give me an incorrect split:

enter image description here

Is there some way to handle those cases or do I need to work it around with RDD?

Thank you in advance.

Upvotes: 1

Views: 2484

Answers (2)

vijayinani
vijayinani

Reputation: 2634

Just another option as below (if you find it simple):

First read the text file as RDD and replace the ":" with ~:~ and save the text file.

sc.textFile(file_path).map(lambda x: x.replace('":"','~:~')).saveAsTextFile(tempPath)

Next, read the temp path and replace ~:~ with ":"again, but this time as a DF.

from pyspark.sql import functions as F
spark.read.option('header','true').csv(tempPath).withColumn('FullLabel',F.regexp_replace(F.col('FullLabel'),'~:~','":"')).show(1, False)

+----------+-----------------------------------------------+----+
|FullName  |FullLabel                                      |Type|
+----------+-----------------------------------------------+----+
|TEST.slice|Consideration":"Verde (Spar Verde, Fonte Verde)|Test|
+----------+-----------------------------------------------+----+

Upvotes: 1

blackbishop
blackbishop

Reputation: 32660

You can read as text using spark.read.text and split the values using some regex to split by comma but ignore the quotes (you can see this post), then get the corresponding columns from the resulting array:

from pyspark.sql import functions as F

df = spark.read.text(file_path)

df = df.filter("value != 'FullName, FullLabel, Type'") \
    .withColumn(
    "value",
    F.split(F.col("value"), ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)')
).select(
    F.col("value")[0].alias("FullName"),
    F.col("value")[1].alias("FullLabel"),
    F.col("value")[2].alias("Type")
)

df.show(truncate=False)

#+----------+--------------------------------------------------+-----+
#|FullName  |FullLabel                                         |Type |
#+----------+--------------------------------------------------+-----+
#|TEST.slice| "Consideration":"Verde (Spar Verde, Fonte Verde)"| Test|
#+----------+--------------------------------------------------+-----+

Update:

For input file in utf-16, you can replace spark.read.text by loading the file as binaryFiles and then convert the resulting rdd into dataframe :

df = sc.binaryFiles(file_path) \
    .flatMap(lambda x: [[l] for l in x[1].decode("utf-16").split("\n")]) \
    .toDF(["value"])

Upvotes: 3

Related Questions