Reputation: 137
I have a number of small files. I want to load them in an RDD. And then map over them to execute an algorithm on these files in parallel. The algorithm will require fetching data from HDFS/Hive-tables. When I use SparkSQL to fetch the data, I get the below error:
pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
SparkSQL uses SQLContext which is a wrapper on SparkContext. Does this mean I cannot use SparkSQL inside a code that executes on workers? But then, it would be too limiting.
Can someone please share some knowledge on how to code my logic in PySpark?
Here is a sample PySpark code that I am using:
def apply_algorithm(filename):
/* SparkSQL logic goes here */
/* some more logic */
return someResult
def main(argv):
print "Entered main method"
input_dir = sys.argv[1]
output_dir = sys.argv[2]
fileNameContentMapRDD = sc.wholeTextFiles(input_dir)
print "fileNameContentMapRDD = " , fileNameContentMapRDD.collect()
resultRDD = fileNameContentMapRDD.map(lambda x : apply_algorithm(x[0]))
print resultRDD.collect()
print "end of main."
Upvotes: 1
Views: 208
Reputation: 21
Does this mean I cannot use SparkSQL inside a code that executes on workers?
Yes, it means exactly this. You can use neither RDDs
nor DataFrames
from parallelized context.
Upvotes: 2