spark-submit throws errors, unlike pyspark

Question

I've been testing a script using Ipython notebooks and passing pyspark to it. Everything I've wanted to accomplish worked well.

I've also ran it without notebook from the command line using pyspark and works.

Using version 1.3.1

When submitting it as a job using spark-submit

spark-submit --master local[*] myscript.py

I'm getting the following error:

x_map = rdd.map(lambda s: (s[1][1],s[1][3])).distinct().toDF().toPandas()

AttributeError: 'PipelinedRDD' object has no attribute 'toDF'

The beginning of my script looks like the following:

from pyspark import SparkContext
sc = SparkContext(appName="Whatever")

from pyspark.sql.types import *
from pyspark.sql import Row
import statsmodels.api as sm
import pandas as pd
import numpy as np
import sys
[..] other python modules

rdd = sc.textFile(input_file)
rdd = rdd.map(lambda line: (line.split(",")[1],[x for x in line.split(",")])).sortByKey()

x_map = rdd.map(lambda s: (s[1][1],s[1][3])).distinct().toDF().toPandas()

Marco · Accepted Answer

As you can read in this link: http://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html

When created, SQLContext adds a method called toDF to RDD, which could be used to convert an RDD into a DataFrame, it’s a shorthand for SQLContext.createDataFrame()

So in order to use toDF method in your RDDs you need to create a sqlContext and initialize it with your SparkContext:

from pyspark.sql import SQLContext
...
sqlContext = SQLContext(sc)

spark-submit throws errors, unlike pyspark

Answers (1)

Related Questions