Reputation: 11
Im using Dataproc cloud for spark computing. The problem is that my working nodes dont have access to textblob package. How can I fix it? I'm coding in jupyter notebook with pyspark kernel
Code error:
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 588, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 249, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'textblob'
Example code that fails:
data = [{"Category": 'Aaaa'},
{"Category": 'Bbbb'},
{"Category": 'Cccc'},
{"Category": 'Eeeee'}
]
df = spark.createDataFrame(data)
def sentPackage(text):
import textblob
return TextBlob(text).sentiment.polarity
sentPackageUDF = udf(sentPackage, StringType(), )
df = df.withColumn("polarity", sentPackageUDF(f.col("Category")))
df.show()
Upvotes: 1
Views: 214
Reputation: 1217
You have to add your libraries to the SparkContext, which makes it available to all worker nodes.
Create a folder of any name to store your libraries that you want the worker nodes to access
mkdir dependencies
Navigate to that folder and copy the libraries in it using pip
pip download textblob -d .
pip .. .. ..
Zip dependencies
zip -r dependencies.zip .
add textblob
to the SparkContext
spark = SparkSession \
.builder \
...
.getOrCreate()
spark.sparkContext.addPyFile("dependencies.zip")
Upvotes: 0
Reputation: 2013
The key is to define a function that will be sent to the workers and import textblob inside there
def function_to_be_executed_by_workers(...):
import textblob
# use textblob and perform operations on data
Upvotes: 0