The Wanderer
The Wanderer

Reputation: 3271

PySpark escapeQuotes=False still escapes quotes

Problem: While writing the dataframe as csv, I do not want to escape quotes. However, setting escapeQuotes=False doesn't seem to be working.

Mentioned below is an example case:

DataPrep:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession, functions as func

spark = SparkSession.builder.appName("test").getOrCreate()

data = [("James", "Smith"),
    ("Michael", "Rose"),
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("lastname",StringType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.show(truncate=False)

Output:

+---------+--------+
|firstname|lastname|
+---------+--------+
|James    |Smith   |
|Michael  |Rose    |
+---------+--------+

Adding column with Newline Character

def create_column_with_newline(elem):
    return f'"{elem["firstname"]}\n{elem["lastname"]}"'


columnWithNewlineUDF = func.udf(create_column_with_newline)

df = df.withColumn('newline_col', columnWithNewlineUDF(func.struct('firstname', 'lastname')))
df.show()

Output:

+---------+--------+-----------------+
|firstname|lastname|      newline_col|
+---------+--------+-----------------+
|    James|   Smith|    "James
Smith"|
|  Michael|    Rose|   "Michael
Rose"|
+---------+--------+-----------------+

Writing csv with escapeQuotes=False

df.coalesce(1).write.csv('test.tsv', mode='overwrite', sep='\t', header=True, encoding='UTF-8', escapeQuotes=False)

Output:

firstname   lastname    newline_col
James   Smith   "\"James
Smith\""
Michael Rose    "\"Michael
Rose\""

As you can see, the newline_col is written with escaped quotes :-(

Expected Output:

firstname   lastname    newline_col
James   Smith   "James
Smith"
Michael Rose    "Michael
Rose"

Upvotes: 2

Views: 1435

Answers (2)

Red
Red

Reputation: 137

setting these two options worked for me

.option("quote","")
.option("escapeQuotes",false)

Upvotes: 1

Kafels
Kafels

Reputation: 4069

Just remove the quotes from UDF:

def create_column_with_newline(elem):
    #      f'"{elem["firstname"]}\n{elem["lastname"]}"'
    return f'{elem["firstname"]}\n{elem["lastname"]}'

Output:

firstname   lastname    newline_col
James   Smith   "James
Smith"
Michael Rose    "Michael
Rose"

Excel visualization:

excel

Upvotes: 1

Related Questions