Reputation: 297
I've been testing a script using Ipython notebooks and passing pyspark to it. Everything I've wanted to accomplish worked well.
I've also ran it without notebook from the command line using pyspark and works.
Using version 1.3.1
When submitting it as a job using spark-submit
spark-submit --master local[*] myscript.py
I'm getting the following error:
x_map = rdd.map(lambda s: (s[1][1],s[1][3])).distinct().toDF().toPandas()
AttributeError: 'PipelinedRDD' object has no attribute 'toDF'
The beginning of my script looks like the following:
from pyspark import SparkContext
sc = SparkContext(appName="Whatever")
from pyspark.sql.types import *
from pyspark.sql import Row
import statsmodels.api as sm
import pandas as pd
import numpy as np
import sys
[..] other python modules
rdd = sc.textFile(input_file)
rdd = rdd.map(lambda line: (line.split(",")[1],[x for x in line.split(",")])).sortByKey()
x_map = rdd.map(lambda s: (s[1][1],s[1][3])).distinct().toDF().toPandas()
Upvotes: 2
Views: 2591
Reputation: 4927
As you can read in this link: http://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html
When created, SQLContext adds a method called toDF to RDD, which could be used to convert an RDD into a DataFrame, it’s a shorthand for SQLContext.createDataFrame()
So in order to use toDF method in your RDDs you need to create a sqlContext and initialize it with your SparkContext:
from pyspark.sql import SQLContext
...
sqlContext = SQLContext(sc)
Upvotes: 5