user8708009
user8708009

Reputation: 137

Using sparksql inside pyspark map logic not working

I have a number of small files. I want to load them in an RDD. And then map over them to execute an algorithm on these files in parallel. The algorithm will require fetching data from HDFS/Hive-tables. When I use SparkSQL to fetch the data, I get the below error:

pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

SparkSQL uses SQLContext which is a wrapper on SparkContext. Does this mean I cannot use SparkSQL inside a code that executes on workers? But then, it would be too limiting.

Can someone please share some knowledge on how to code my logic in PySpark?

Here is a sample PySpark code that I am using:

def apply_algorithm(filename):
    /* SparkSQL logic goes here */ 
    /* some more logic */
    return someResult


def main(argv):
    print "Entered main method"
    input_dir = sys.argv[1]
    output_dir = sys.argv[2]

    fileNameContentMapRDD = sc.wholeTextFiles(input_dir)
    print "fileNameContentMapRDD = " , fileNameContentMapRDD.collect()

    resultRDD = fileNameContentMapRDD.map(lambda x : apply_algorithm(x[0]))

    print resultRDD.collect()
    print "end of main."

Upvotes: 1

Views: 208

Answers (1)

user8708136
user8708136

Reputation: 21

Does this mean I cannot use SparkSQL inside a code that executes on workers?

Yes, it means exactly this. You can use neither RDDs nor DataFrames from parallelized context.

Upvotes: 2

Related Questions