kavya
kavya

Reputation: 789

How to append to a csv file using df.write.csv in pyspark?

I'm trying to append data to my csv file using df.write.csv. This is what I did after following spark document http://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter:

from pyspark.sql import DataFrameWriter
.....
df1 = sqlContext.createDataFrame(query1)
df1.write.csv("/opt/Output/sqlcsvA.csv", append) #also tried 'mode=append'

Executing the above code gives me error:

NameError: name 'append' not defined

Without append, error:

The path already exists.

Upvotes: 12

Views: 55227

Answers (3)

Davos
Davos

Reputation: 5435

From the docs: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter Since v1.4

csv(path, mode=None, compression=None, sep=None, quote=None, escape=None, header=None, nullValue=None, escapeQuotes=None, quoteAll=None, dateFormat=None, timestampFormat=None)

e.g.

from pyspark.sql import DataFrameWriter
.....
df1 = sqlContext.createDataFrame(query1)
df1.write.csv(path="/opt/Output/sqlcsvA.csv", mode="append")

If you want to write a single file, you can use coalesce or repartition on either of those lines. It doesn't matter which line, because the dataframe is just a DAG execution , no execution happens until the write to csv. repartition & coalesce effectively use the same code, but coalesce can only reduce the number of partitions where repartition can also increase them. I'd just stick to repartition for simplicity.

e.g.

df1 = sqlContext.createDataFrame(query1).repartition(1)

or

df1.repartition(1).write.csv(path="/opt/Output/sqlcsvA.csv", mode="append")

I think the examples in the docs aren't great, they don't show examples of using parameters other than the path.

Referring to the two things you tried:

(append)

For that to work, there would need to be a string variable named append containing the value "append". There's no string constant in the DataFrameWriter library called append. i.e. you could add this earlier in your code, and it would then work. append = "append"

('mode=append')

For that to work, the csv method would have to parse out the mode=append string to get the value for the mode, which would be extra work when you can just have a parameter with exactly the value "append" or "overwrite" that needs to be extracted. None is a special case, Python built in, not specific to pyspark.

On another note, I recommend using named parameters where possible. e.g.

csv(path="/path/to/file.csv", mode="append")

instead of positional parameters

csv("/path/to/file.csv", "append")

It's clearer, and helps comprehension.

Upvotes: 3

Zhang Tong
Zhang Tong

Reputation: 4719

df.write.save(path='csv', format='csv', mode='append', sep='\t')

Upvotes: 16

Anton Okolnychyi
Anton Okolnychyi

Reputation: 976

I do not about Python, but in Scala and Java one can set the the save mode in the following way:

df.write.mode("append").csv("pathToFile")

I assume that it should be similar in Python. This may be helpful.

Upvotes: 1

Related Questions