Reputation: 2146
Using a very simple-minded approach to read data, select a subset of it, and write it, I'm getting that 'DataFrameWriter' object is not callable.
I'm surely missing something basic.
Using an AWS EMR:
$ pyspark
> dx = spark.read.parquet("s3://my_folder/my_date*/*.gz.parquet")
> dx_sold = dx.filter("keywords like '%sold%'")
# select customer ids
> dc = dx_sold.select("agent_id")
Question The goal is to now save the values of dc ... e.g. to s3 as a line-separated text file.
What's a best-practice to do so?
Attempts
I tried
dc.write("s3://my_folder/results/")
but received
TypeError: 'DataFrameWriter' object is not callable
Also tried
X = dc.collect()
but eventually received a TimeOut error message.
Also tried
dc.write.format("csv").options(delimiter=",").save("s3://my_folder/results/")
But eventually received messages of the form
TaskSetManager: Lost task 4323.0 in stage 9.0 (TID 88327, ip-<hidden>.internal, executor 96): TaskKilled (killed intentionally)
Upvotes: 1
Views: 2571
Reputation: 2146
The first comment is correct: it was an FS problem. Ad-hoc solution was to convert desired results to list and then serialize the list. E.g.
dc = dx_sold.select("agent_id").distinct()
result_list = [str(c) for c in dc.collect()]
pickle.dump(result_list, open(result_path, "wb"))
Upvotes: 1