fatdragon
fatdragon

Reputation: 2309

SparkContext can only be used on the driver

I am trying to use SparkContext.binaryFiles function to process a set of ZIP files. The setup is to map from a RDD of filenames, in which the mapping function uses the binaryFiles function.

The problem is that SparkContext is referenced in the mapping function, and I'm getting this error. How can I fix it?

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Sample code:

file_list_rdd.map(lambda x: sc.binaryFiles("/FileStore/tables/xyz/" + x[1]))

where file_list_rdd is a RDD of (id, filename) tuples.

Upvotes: 4

Views: 13836

Answers (1)

Ged
Ged

Reputation: 18098

It would appear that you need to call the function without referencing the spark context - and if that is actually applicable.

Also consider moving the function / def into the map body statement(s) itself. That is commonly done - and we are using a functional language. I have been at a loss to resolve Serialization errors unless I resort to the aforementioned and move defs to the Executor logic.

Some file processing is also done via the driver. This post could be of interest: How to paralelize spark etl more w/out losing info (in file names). Based on your code snippet I would be looking at this here.

And you should use something like this and process accordingly:

 zip_data = sc.binaryFiles('/user/path-to-folder-with-zips/*.zip')

Now you are using it from the Driver and the sc.

Upvotes: 2

Related Questions