Reputation: 20242
I have the following code where I am trying to create a DataFrame
from a PipelinedRDD`:
print type(simulation)
sqlContext.createDataFrame(simulation)
The print
statement prints this:
<class 'pyspark.rdd.PipelinedRDD'>
However, on the next line I am getting this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
The error has this trace:
---> 13 sqlContext.createDataFrame(simulation)
/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
421
422 if isinstance(data, RDD):
--> 423 rdd, schema = self._createFromRDD(data, schema, samplingRatio)
424 else:
425 rdd, schema = self._createFromLocal(data, schema)
/databricks/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
308 """
309 if schema is None or isinstance(schema, (list, tuple)):
--> 310 struct = self._inferSchema(rdd, samplingRatio)
Upvotes: 1
Views: 2769
Reputation: 910
It seems that the schema cannot be inferred from your data. If you don't specify the samplingRatio, only the first row will be used to determine the types. You should either try a non zero sampling ratio or specify a schema as follows:
schema = StructType([StructField("int_field", IntegerType()),
StructField("string_field", StringType())])
Upvotes: 2