Reputation: 11

Textblob module not being found in pyspark dataproc cluster

Im using Dataproc cloud for spark computing. The problem is that my working nodes dont have access to textblob package. How can I fix it? I'm coding in jupyter notebook with pyspark kernel

Code error:

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 588, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 249, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'textblob'

Example code that fails:

data = [{"Category": 'Aaaa'},
        {"Category": 'Bbbb'},
        {"Category": 'Cccc'},
        {"Category": 'Eeeee'}
        ]
df = spark.createDataFrame(data)

def sentPackage(text):
    import textblob
    return TextBlob(text).sentiment.polarity


sentPackageUDF = udf(sentPackage, StringType(), )
df = df.withColumn("polarity", sentPackageUDF(f.col("Category")))
df.show()

Upvotes: 1

Answers (2)

wc203

Reputation: 1217

You have to add your libraries to the SparkContext, which makes it available to all worker nodes.

Create a folder of any name to store your libraries that you want the worker nodes to access

mkdir dependencies

Navigate to that folder and copy the libraries in it using pip

pip download textblob -d .
pip .. .. ..

Zip dependencies

zip -r dependencies.zip .

add textblob to the SparkContext

spark = SparkSession \
    .builder \
    ...
    .getOrCreate()

spark.sparkContext.addPyFile("dependencies.zip")

Upvotes: 0

rikyeah

Reputation: 2013

The key is to define a function that will be sent to the workers and import textblob inside there

def function_to_be_executed_by_workers(...):
    import textblob
    # use textblob and perform operations on data

Upvotes: 0

Textblob module not being found in pyspark dataproc cluster

Answers (2)

Related Questions