Quetzalcoatl
Quetzalcoatl

Reputation: 2146

DataFrameWriter not callable

Using a very simple-minded approach to read data, select a subset of it, and write it, I'm getting that 'DataFrameWriter' object is not callable.

I'm surely missing something basic.

Using an AWS EMR:

$ pyspark
> dx = spark.read.parquet("s3://my_folder/my_date*/*.gz.parquet")    
> dx_sold = dx.filter("keywords like '%sold%'")    
# select customer ids
> dc = dx_sold.select("agent_id")

Question The goal is to now save the values of dc ... e.g. to s3 as a line-separated text file.

What's a best-practice to do so?

Attempts

I tried

dc.write("s3://my_folder/results/") 

but received

TypeError: 'DataFrameWriter' object is not callable

Also tried

X = dc.collect()

but eventually received a TimeOut error message.

Also tried

dc.write.format("csv").options(delimiter=",").save("s3://my_folder/results/")

But eventually received messages of the form

TaskSetManager: Lost task 4323.0 in stage 9.0 (TID 88327, ip-<hidden>.internal, executor 96): TaskKilled (killed intentionally)

Upvotes: 1

Views: 2571

Answers (1)

Quetzalcoatl
Quetzalcoatl

Reputation: 2146

The first comment is correct: it was an FS problem. Ad-hoc solution was to convert desired results to list and then serialize the list. E.g.

dc = dx_sold.select("agent_id").distinct()
result_list = [str(c) for c in dc.collect()]
pickle.dump(result_list, open(result_path, "wb"))

Upvotes: 1

Related Questions