Reputation: 405
I am having some troubles reading the following CSV data in UTF-16
:
FullName, FullLabel, Type
TEST.slice, "Consideration":"Verde (Spar Verde, Fonte Verde)", Test,
As far as I understand, it should not be a problem for a reader, since there is a quote
parameter to handle that.
df = spark.read.csv(file_path, header=True, encoding='UTF-16', quote = '"')
However, this would still give me an incorrect split:
Is there some way to handle those cases or do I need to work it around with RDD?
Thank you in advance.
Upvotes: 1
Views: 2484
Reputation: 2634
Just another option as below (if you find it simple):
First read the text file as RDD and replace the ":"
with ~:~
and save the text file.
sc.textFile(file_path).map(lambda x: x.replace('":"','~:~')).saveAsTextFile(tempPath)
Next, read the temp path and replace ~:~
with ":"
again, but this time as a DF.
from pyspark.sql import functions as F
spark.read.option('header','true').csv(tempPath).withColumn('FullLabel',F.regexp_replace(F.col('FullLabel'),'~:~','":"')).show(1, False)
+----------+-----------------------------------------------+----+
|FullName |FullLabel |Type|
+----------+-----------------------------------------------+----+
|TEST.slice|Consideration":"Verde (Spar Verde, Fonte Verde)|Test|
+----------+-----------------------------------------------+----+
Upvotes: 1
Reputation: 32660
You can read as text using spark.read.text
and split the values using some regex to split by comma but ignore the quotes (you can see this post), then get the corresponding columns from the resulting array:
from pyspark.sql import functions as F
df = spark.read.text(file_path)
df = df.filter("value != 'FullName, FullLabel, Type'") \
.withColumn(
"value",
F.split(F.col("value"), ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)')
).select(
F.col("value")[0].alias("FullName"),
F.col("value")[1].alias("FullLabel"),
F.col("value")[2].alias("Type")
)
df.show(truncate=False)
#+----------+--------------------------------------------------+-----+
#|FullName |FullLabel |Type |
#+----------+--------------------------------------------------+-----+
#|TEST.slice| "Consideration":"Verde (Spar Verde, Fonte Verde)"| Test|
#+----------+--------------------------------------------------+-----+
Update:
For input file in utf-16
, you can replace spark.read.text
by loading the file as binaryFiles
and then convert the resulting rdd into dataframe :
df = sc.binaryFiles(file_path) \
.flatMap(lambda x: [[l] for l in x[1].decode("utf-16").split("\n")]) \
.toDF(["value"])
Upvotes: 3