Spark dataframe databricks csv appends extra double quotes

Question

It seems that when I apply CONCAT on a dataframe in spark sql and store that dataframe as csv file in a HDFS location, then there are extra double quotes added to that concat column alone in the ouput file .

This double quotes are not added when I appy show.This double quotes are added only when I store that dataframe as a csv file

It seems that I need to remove the extra double quotes that gets added when I save the dataframe as csv file.

I am using the com.databricks:spark-csv_2.10:1.1.0 jar

Spark Version is 1.5.0-cdh5.5.1

Input :

 campaign_file_name_1, campaign_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89,    1
 campaign_file_name_1, campaign_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk,    2

Expected output :

 campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89,     campaign_name_1"="1,  2017-06-06 17:09:31
 campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk,   campaign_name_1"="2,  2017-06-06 17:09:31

Spark Code :

  object campaignResultsMergerETL extends BaseETL {

  val now  = ApplicationUtil.getCurrentTimeStamp()
  val conf = new Configuration()
  val fs  = FileSystem.get(conf)
  val log = LoggerFactory.getLogger(this.getClass.getName)

  def main(args: Array[String]): Unit = {
    //---------------------
    code for sqlContext Initialization 
    //---------------------
    val campaignResultsDF  = sqlContext.read.format("com.databricks.spark.avro").load(campaignResultsLoc)
    campaignResultsDF.registerTempTable("campaign_results")
    val campaignGroupedDF =  sqlContext.sql(
   """
    |SELECT campaign_file_name,
    |campaign_name,
    |tracker_id,
    |SUM(campaign_measure) AS campaign_measure
    |FROM campaign_results
    |GROUP BY campaign_file_name,campaign_name,tracker_id
  """.stripMargin)

    campaignGroupedDF.registerTempTable("campaign_results_full")

    val campaignMergedDF =  sqlContext.sql(
  s"""
    |SELECT campaign_file_name,
    |tracker_id,
    |CONCAT(campaign_name,'\"=\"' ,campaign_measure),
    |"$now" AS audit_timestamp
    |FROM campaign_results_full
  """.stripMargin)

   campaignMergedDF.show(20)
   saveAsCSVFiles(campaignMergedDF, campaignResultsExportLoc, numPartitions)

   }


    def saveAsCSVFiles(campaignMeasureDF:DataFrame,hdfs_output_loc:String,numPartitions:Int): Unit =
    {
       log.info("saveAsCSVFile method started")
       if (fs.exists(new Path(hdfs_output_loc))){
          fs.delete(new Path(hdfs_output_loc), true)
       }
     campaignMeasureDF.repartition(numPartitions).write.format("com.databricks.spark.csv").save(hdfs_output_loc)
       log.info("saveAsCSVFile method ended")
    }

 }

Result of campaignMergedDF.show(20) is correct and works fine.

 campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89,   campaign_name_1"="1,  2017-06-06 17:09:31
 campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk,   campaign_name_1"="2,  2017-06-06 17:09:31

Result of saveAsCSVFiles : This is incorrect .

 campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89,   "campaign_name_1""=""1",  2017-06-06 17:09:31
 campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk,   "campaign_name_1""=""2",  2017-06-06 17:09:31

Could some one help me on this issue?

Yaron · Accepted Answer

When you are using

write.format("com.databricks.spark.csv").save(hdfs_output_loc)

In order to write a text which includes " into a csv file, you are facing problems as the " sign is defined as default quote by spark-csv

Replacing the default quote from " to something else (e.g. NULL) should allow you to write " to the file as is.

write.format("com.databricks.spark.csv").option("quote", "\u0000").save(hdfs_output_loc)

Explanation:

You are using the default spark-csv:

escape value which is \
quote value which is "

spark-csv doc

quote: by default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored
escape: by default the escape character is \, but can be set to any character. Escaped quote characters are ignored

This answer suggested the following:

The way to turn off the default escaping of the double quote character (") with the backslash character () - i.e. to avoid escaping for all characters entirely, you must add an .option() method call with just the right parameters after the .write() method call. The goal of the option() method call is to change how the csv() method "finds" instances of the "quote" character as it is emitting the content. To do this, you must change the default of what a "quote" actually means; i.e. change the character sought from being a double quote character (") to a Unicode "\u0000" character (essentially providing the Unicode NUL character assuming it won't ever occur within the document).

Spark dataframe databricks csv appends extra double quotes

Answers (1)

Related Questions