Avinash Peyyety
Avinash Peyyety

Reputation: 25

What is the correct usage of the method Parallelize in the Spark module pyspark.mllib.classification

  1. Running Databricks Community Edition spark cluster from Notebook UI
  2. Facing this error when attempting to train a NaiveBayes for a tiny data sample - TypeError: unbound method parallelize() must be called with SparkContext instance as first argument (got list instance instead)
  3. Code :

    from pyspark.mllib.classification import LabeledPoint, NaiveBayes
    from pyspark import SparkContext as sc
    data = [
    LabeledPoint(0.0, [0.0, 0.0]),
    LabeledPoint(0.0, [0.0, 1.0]),
    LabeledPoint(1.0, [1.0, 0.0])]
    model = NaiveBayes.train(sc.parallelize(data))
    model.predict(array([0.0, 1.0]))
    model.predict(array([1.0, 0.0]))
    model.predict(sc.parallelize([[1.0, 0.0]])).collect()
    

Upvotes: 0

Views: 496

Answers (1)

Josh Rosen
Josh Rosen

Reputation: 13831

The problem here is the import on line two of your example:

from pyspark import SparkContext as sc

This is overwriting the built-in SparkContext instance (stored in sc) with the SparkContext class, causing the later sc.parallelize() call to fail.

In Databricks, you don't need to create the SparkContext yourself; it's automatically pre-defined as sc in Databricks notebooks. See https://docs.databricks.com/user-guide/getting-started.html#predefined-variables for a more complete list of pre-defined variables in Databricks.

Upvotes: 1

Related Questions