Reputation: 243
I am saving spark dataframe into csv file. All the records is saving in double quotes that is fine but column name also coming in double quotes. Could you please help me how to remove them?
Example:
"Source_System"|"Date"|"Market_Volume"|"Volume_Units"|"Market_Value"|"Value_Currency"|"Sales_Channel"|"Competitor_Name"
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"
Desirable Output:
Source_System|Date|Market_Volume|Volume_Units|Market_Value|Value_Currency|Sales_Channel|Competitor_Name
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"
I am using below code:
df4.repartition(1).write.csv(Output_Path_ASPAC, quote='"', header=True, quoteAll=True, sep='|', mode='overwrite')
Upvotes: 2
Views: 3660
Reputation: 31470
I think only workaround would be concat quotes to the column values
in dataframe before writing to csv
.
Example:
df.show()
#+---+----+------+
#| id|name|salary|
#+---+----+------+
#| 1| a| 100|
#+---+----+------+
from pyspark.sql.functions import col, concat, lit
cols = [concat(lit('"'), col(i), lit('"')).alias(i) for i in df.columns]
df1=df.select(*cols)
df1.show()
#+---+----+------+
#| id|name|salary|
#+---+----+------+
#|"1"| "a"| "100"|
#+---+----+------+
df1.\
write.\
csv("<path>", header=True, sep='|',escape='', quote='',mode='overwrite')
#output
#cat tmp4/part*
#id|name|salary
#"1"|"a"|"100"
Upvotes: 1