Reputation: 3599
It seems that when I apply CONCAT
on a dataframe
in spark sql
and store that dataframe
as csv file in a HDFS
location, then there are extra double quotes added to that concat
column alone in the ouput file .
This double quotes are not added when I appy show.This double quotes are added only when I store that dataframe
as a csv file
It seems that I need to remove the extra double quotes that gets added when I save the dataframe
as csv file.
I am using the com.databricks:spark-csv_2.10:1.1.0
jar
Spark Version is 1.5.0-cdh5.5.1
Input :
campaign_file_name_1, campaign_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, 1
campaign_file_name_1, campaign_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, 2
Expected output :
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, campaign_name_1"="1, 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, campaign_name_1"="2, 2017-06-06 17:09:31
Spark Code :
object campaignResultsMergerETL extends BaseETL {
val now = ApplicationUtil.getCurrentTimeStamp()
val conf = new Configuration()
val fs = FileSystem.get(conf)
val log = LoggerFactory.getLogger(this.getClass.getName)
def main(args: Array[String]): Unit = {
//---------------------
code for sqlContext Initialization
//---------------------
val campaignResultsDF = sqlContext.read.format("com.databricks.spark.avro").load(campaignResultsLoc)
campaignResultsDF.registerTempTable("campaign_results")
val campaignGroupedDF = sqlContext.sql(
"""
|SELECT campaign_file_name,
|campaign_name,
|tracker_id,
|SUM(campaign_measure) AS campaign_measure
|FROM campaign_results
|GROUP BY campaign_file_name,campaign_name,tracker_id
""".stripMargin)
campaignGroupedDF.registerTempTable("campaign_results_full")
val campaignMergedDF = sqlContext.sql(
s"""
|SELECT campaign_file_name,
|tracker_id,
|CONCAT(campaign_name,'\"=\"' ,campaign_measure),
|"$now" AS audit_timestamp
|FROM campaign_results_full
""".stripMargin)
campaignMergedDF.show(20)
saveAsCSVFiles(campaignMergedDF, campaignResultsExportLoc, numPartitions)
}
def saveAsCSVFiles(campaignMeasureDF:DataFrame,hdfs_output_loc:String,numPartitions:Int): Unit =
{
log.info("saveAsCSVFile method started")
if (fs.exists(new Path(hdfs_output_loc))){
fs.delete(new Path(hdfs_output_loc), true)
}
campaignMeasureDF.repartition(numPartitions).write.format("com.databricks.spark.csv").save(hdfs_output_loc)
log.info("saveAsCSVFile method ended")
}
}
Result of campaignMergedDF.show(20)
is correct and works fine.
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, campaign_name_1"="1, 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, campaign_name_1"="2, 2017-06-06 17:09:31
Result of saveAsCSVFiles
: This is incorrect .
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, "campaign_name_1""=""1", 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, "campaign_name_1""=""2", 2017-06-06 17:09:31
Could some one help me on this issue?
Upvotes: 1
Views: 4326
Reputation: 10450
When you are using
write.format("com.databricks.spark.csv").save(hdfs_output_loc)
In order to write a text which includes "
into a csv file, you are facing problems as the "
sign is defined as default quote by spark-csv
Replacing the default quote from "
to something else (e.g. NULL) should allow you to write "
to the file as is.
write.format("com.databricks.spark.csv").option("quote", "\u0000").save(hdfs_output_loc)
Explanation:
You are using the default spark-csv:
\
"
This answer suggested the following:
The way to turn off the default escaping of the double quote character (") with the backslash character () - i.e. to avoid escaping for all characters entirely, you must add an .option() method call with just the right parameters after the .write() method call. The goal of the option() method call is to change how the csv() method "finds" instances of the "quote" character as it is emitting the content. To do this, you must change the default of what a "quote" actually means; i.e. change the character sought from being a double quote character (") to a Unicode "\u0000" character (essentially providing the Unicode NUL character assuming it won't ever occur within the document).
Upvotes: 1