domkck
domkck

Reputation: 1156

PySpark SQLContext.createDataFrame producing nulls when declared and actual field types don't match

In PySpark (v1.6.2) when converting an RDD to a DataFrame with a specified schema, fields who's value type doesn't match the one declared in the schema get converted to null.

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType, StructField, DoubleType

sc = SparkContext()
sqlContext = SQLContext(sc)

schema = StructType([
    StructField("foo", DoubleType(), nullable=False)
])

rdd = sc.parallelize([{"foo": 1}])
df = sqlContext.createDataFrame(rdd, schema=schema)

print df.show()

+----+
| foo|
+----+
|null|
+----+

Is this a PySpark bug or just very surprising but intended behaviour? I would either expect a TypeError to be raised or for the int to be converted to float compatible with DoubleType.

Upvotes: 0

Views: 622

Answers (1)

zero323
zero323

Reputation: 330093

This an intended behavior. In particular see the comments to the corresponding part of the source:

// all other unexpected type should be null, or we will have runtime exception
// TODO(davies): we could improve this by try to cast the object to expected type
case (c, _) => null

Upvotes: 3

Related Questions