Shivam
Shivam

Reputation: 243

How to remove double quotes from column name while saving dataframe in csv in spark?

I am saving spark dataframe into csv file. All the records is saving in double quotes that is fine but column name also coming in double quotes. Could you please help me how to remove them?

Example:

"Source_System"|"Date"|"Market_Volume"|"Volume_Units"|"Market_Value"|"Value_Currency"|"Sales_Channel"|"Competitor_Name"
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"

Desirable Output:

Source_System|Date|Market_Volume|Volume_Units|Market_Value|Value_Currency|Sales_Channel|Competitor_Name
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"

I am using below code:

df4.repartition(1).write.csv(Output_Path_ASPAC, quote='"', header=True, quoteAll=True, sep='|', mode='overwrite')

Upvotes: 2

Views: 3660

Answers (1)

notNull
notNull

Reputation: 31470

I think only workaround would be concat quotes to the column values in dataframe before writing to csv.

Example:

df.show()
#+---+----+------+
#| id|name|salary|
#+---+----+------+
#|  1|   a|   100|
#+---+----+------+

from pyspark.sql.functions import col, concat, lit

cols = [concat(lit('"'), col(i), lit('"')).alias(i) for i in df.columns]
df1=df.select(*cols)

df1.show()
#+---+----+------+
#| id|name|salary|
#+---+----+------+
#|"1"| "a"| "100"|
#+---+----+------+

df1.\
write.\
csv("<path>", header=True, sep='|',escape='', quote='',mode='overwrite')

#output
#cat tmp4/part*
#id|name|salary
#"1"|"a"|"100"

Upvotes: 1

Related Questions