Pyspark - Data set to null when converting rdd to dataframe

Question

With PySpark I'm trying to convert a RDD of nested dicts into a dataframe but I'm losing data in some fields which are set to null.

Here's the code :

sc = SparkContext()
sqlContext = SQLContext(sc)

def convert_to_row(d):
    return Row(**d)

df2 = sc.parallelize([{"id": "14yy74hwogxoyl2l3v", "geoloc": {"country": {"geoname_id": 3017382, "iso_code": "FR", "name": "France"}}}]).map(convert_to_row).toDF()
df2.printSchema()
df2.show()
df2.toJSON().saveAsTextFile("/tmp/json.test")

When I'm having a look at /tmp/json.test, here's the content (after manually indent):

{
    "geoloc": {
        "country": {
            "name": null,
            "iso_code": null,
            "geoname_id": 3017382
        }
    },
    "id": "14yy74hwogxoyl2l3v"
}

iso_code and name have been converted to null.

Can anyone help me with it ? I can't understand it.

I'm using Python 2.7 and Spark 2.0.0

Thanks !

desertnaut · Accepted Answer

Following the explanation already provided by @user6910411 (and saving me the time to do it myself), the remedy (i.e. the intermediate JSON representation) is to use read.json instead of toDF and your function:

spark.version
# u'2.0.2'

jsonRDD = sc.parallelize([{"id": "14yy74hwogxoyl2l3v", "geoloc": {"country": {"geoname_id": 3017382, "iso_code": "FR", "name": "France"}}}])

df = spark.read.json(jsonRDD)
df.collect()
# result:
[Row(geoloc=Row(country=Row(geoname_id=3017382, iso_code=u'FR', name=u'France')), id=u'14yy74hwogxoyl2l3v')]

# just to have a look at what will be saved:
df.toJSON().collect()
# result:
[u'{"geoloc":{"country":{"geoname_id":3017382,"iso_code":"FR","name":"France"}},"id":"14yy74hwogxoyl2l3v"}']

df.toJSON().saveAsTextFile("/tmp/json.test")

For comparison, here is how your own df2 looks:

df2.collect()
# result:
[Row(geoloc={u'country': {u'geoname_id': 3017382, u'iso_code': None, u'name': None}}, id=u'14yy74hwogxoyl2l3v')]

df2.toJSON().collect()
# result:
[u'{"geoloc":{"country":{"name":null,"iso_code":null,"geoname_id":3017382}},"id":"14yy74hwogxoyl2l3v"}']

Pyspark - Data set to null when converting rdd to dataframe

Answers (2)

Related Questions