Reputation: 1156
In PySpark (v1.6.2) when converting an RDD to a DataFrame with a specified schema, fields who's value type doesn't match the one declared in the schema get converted to null
.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType, StructField, DoubleType
sc = SparkContext()
sqlContext = SQLContext(sc)
schema = StructType([
StructField("foo", DoubleType(), nullable=False)
])
rdd = sc.parallelize([{"foo": 1}])
df = sqlContext.createDataFrame(rdd, schema=schema)
print df.show()
+----+
| foo|
+----+
|null|
+----+
Is this a PySpark bug or just very surprising but intended behaviour? I would either expect a TypeError
to be raised or for the int
to be converted to float
compatible with DoubleType
.
Upvotes: 0
Views: 622
Reputation: 330093
This an intended behavior. In particular see the comments to the corresponding part of the source:
// all other unexpected type should be null, or we will have runtime exception
// TODO(davies): we could improve this by try to cast the object to expected type
case (c, _) => null
Upvotes: 3