bsky
bsky

Reputation: 20242

Error when creating DataFrame from RDD

I have the following code where I am trying to create a DataFrame from a PipelinedRDD`:

  print type(simulation)
  sqlContext.createDataFrame(simulation)

The print statement prints this:

<class 'pyspark.rdd.PipelinedRDD'>

However, on the next line I am getting this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

The error has this trace:

---> 13   sqlContext.createDataFrame(simulation)

/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
    421 
    422         if isinstance(data, RDD):
--> 423             rdd, schema = self._createFromRDD(data, schema, samplingRatio)
    424         else:
    425             rdd, schema = self._createFromLocal(data, schema)

/databricks/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
    308         """
    309         if schema is None or isinstance(schema, (list, tuple)):
--> 310             struct = self._inferSchema(rdd, samplingRatio)

Upvotes: 1

Views: 2769

Answers (1)

Sorin
Sorin

Reputation: 910

It seems that the schema cannot be inferred from your data. If you don't specify the samplingRatio, only the first row will be used to determine the types. You should either try a non zero sampling ratio or specify a schema as follows:

schema = StructType([StructField("int_field", IntegerType()),
                     StructField("string_field", StringType())])

Upvotes: 2

Related Questions