Reputation: 89
spark.read
val data = spark.read
.option("delimiter", "\t")
.quote("quote", "\"")
.csv("file:///opt/spark/test1.tsv")
incorrectly interprets lines with missing quotes, even though tab delimeter exists for example line:
"aaa" \t "b'bb \t 222
is interpreted as "aaa", "b`bb 222"
instead of
"aaa", "b`bb", "222"
according to the documentation deli-meters inside quotes are ignored.
I can get around the problem by re defining default quote for example:
.option("quote","+")
but it's not a good solution
Upvotes: 2
Views: 492
Reputation: 12910
if quotes are not closing properly, the only option is to keep it when creating dataframe and later on drop it using custom logic.
scala> spark.read.option("delimiter", "\t").option("quote", "").csv("test.csv").show()
+-----+-----+---+
| _c0| _c1|_c2|
+-----+-----+---+
|"aaa"|"b'bb| 22|
+-----+-----+---+
Now if you know which column, might have an issue just apply the following logic.
scala> df.withColumn("col_without_quotes", regexp_replace($"_c0","\"","")).show()
+-----+-----+---+------------------+
| _c0| _c1|_c2|col_without_quotes|
+-----+-----+---+------------------+
|"aaa"|"b'bb| 22| aaa|
+-----+-----+---+------------------+
Upvotes: 2