Lwica Gorska
Lwica Gorska

Reputation: 89

Spark Read csv with missing quotes

spark.read

val data = spark.read
      .option("delimiter", "\t")
      .quote("quote", "\"")
      .csv("file:///opt/spark/test1.tsv")

incorrectly interprets lines with missing quotes, even though tab delimeter exists for example line:

"aaa" \t "b'bb \t 222 

is interpreted as "aaa", "b`bb 222" instead of

"aaa", "b`bb", "222"

according to the documentation deli-meters inside quotes are ignored. I can get around the problem by re defining default quote for example: .option("quote","+") but it's not a good solution

Upvotes: 2

Views: 492

Answers (1)

Gaurang Shah
Gaurang Shah

Reputation: 12910

if quotes are not closing properly, the only option is to keep it when creating dataframe and later on drop it using custom logic.

scala> spark.read.option("delimiter", "\t").option("quote", "").csv("test.csv").show()
+-----+-----+---+
|  _c0|  _c1|_c2|
+-----+-----+---+
|"aaa"|"b'bb| 22|
+-----+-----+---+

Now if you know which column, might have an issue just apply the following logic.

scala> df.withColumn("col_without_quotes", regexp_replace($"_c0","\"","")).show()
+-----+-----+---+------------------+
|  _c0|  _c1|_c2|col_without_quotes|
+-----+-----+---+------------------+
|"aaa"|"b'bb| 22|               aaa|
+-----+-----+---+------------------+

Upvotes: 2

Related Questions