Spark Read csv with missing quotes

Question

spark.read

val data = spark.read
      .option("delimiter", "	")
      .quote("quote", """)
      .csv("file:///opt/spark/test1.tsv")

incorrectly interprets lines with missing quotes, even though tab delimeter exists for example line:

"aaa" 	 "b'bb 	 222

is interpreted as "aaa", "b`bb 222" instead of

"aaa", "b`bb", "222"

according to the documentation deli-meters inside quotes are ignored. I can get around the problem by re defining default quote for example: .option("quote","+") but it's not a good solution

Gaurang Shah · Accepted Answer

if quotes are not closing properly, the only option is to keep it when creating dataframe and later on drop it using custom logic.

scala> spark.read.option("delimiter", "	").option("quote", "").csv("test.csv").show()
+-----+-----+---+
|  _c0|  _c1|_c2|
+-----+-----+---+
|"aaa"|"b'bb| 22|
+-----+-----+---+

Now if you know which column, might have an issue just apply the following logic.

scala> df.withColumn("col_without_quotes", regexp_replace($"_c0",""","")).show()
+-----+-----+---+------------------+
|  _c0|  _c1|_c2|col_without_quotes|
+-----+-----+---+------------------+
|"aaa"|"b'bb| 22|               aaa|
+-----+-----+---+------------------+

Spark Read csv with missing quotes

Answers (1)

Related Questions