What is the correct usage of the method Parallelize in the Spark module pyspark.mllib.classification

Question

Running Databricks Community Edition spark cluster from Notebook UI
Facing this error when attempting to train a NaiveBayes for a tiny data sample - TypeError: unbound method parallelize() must be called with SparkContext instance as first argument (got list instance instead)

Code :

from pyspark.mllib.classification import LabeledPoint, NaiveBayes
from pyspark import SparkContext as sc
data = [
LabeledPoint(0.0, [0.0, 0.0]),
LabeledPoint(0.0, [0.0, 1.0]),
LabeledPoint(1.0, [1.0, 0.0])]
model = NaiveBayes.train(sc.parallelize(data))
model.predict(array([0.0, 1.0]))
model.predict(array([1.0, 0.0]))
model.predict(sc.parallelize([[1.0, 0.0]])).collect()

Josh Rosen · Accepted Answer

The problem here is the import on line two of your example:

from pyspark import SparkContext as sc

This is overwriting the built-in SparkContext instance (stored in sc) with the SparkContext class, causing the later sc.parallelize() call to fail.

In Databricks, you don't need to create the SparkContext yourself; it's automatically pre-defined as sc in Databricks notebooks. See https://docs.databricks.com/user-guide/getting-started.html#predefined-variables for a more complete list of pre-defined variables in Databricks.

What is the correct usage of the method Parallelize in the Spark module pyspark.mllib.classification

Answers (1)

Related Questions