Error when creating DataFrame from RDD

Question

I have the following code where I am trying to create a DataFrame from a PipelinedRDD`:

  print type(simulation)
  sqlContext.createDataFrame(simulation)

The print statement prints this:

However, on the next line I am getting this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

The error has this trace:

---> 13   sqlContext.createDataFrame(simulation)

/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
    421 
    422         if isinstance(data, RDD):
--> 423             rdd, schema = self._createFromRDD(data, schema, samplingRatio)
    424         else:
    425             rdd, schema = self._createFromLocal(data, schema)

/databricks/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
    308         """
    309         if schema is None or isinstance(schema, (list, tuple)):
--> 310             struct = self._inferSchema(rdd, samplingRatio)

Sorin · Accepted Answer

It seems that the schema cannot be inferred from your data. If you don't specify the samplingRatio, only the first row will be used to determine the types. You should either try a non zero sampling ratio or specify a schema as follows:

schema = StructType([StructField("int_field", IntegerType()),
                     StructField("string_field", StringType())])

Error when creating DataFrame from RDD

Answers (1)

Related Questions