Stas Lambant
Stas Lambant

Reputation: 33

spark.createDataFrame() returns a 'NoneType' object

I've parsed a nested json file using the following code:

df = df.select(df.data.attributes.signatures_by_constituency.ons_code.alias("wc_code"), \
      df.data.attributes.signatures_by_constituency.signature_count.alias("sign_count")) \

This results in the following dataframe with values sitting in arrays.

+--------------------+--------------------+
|             wc_code|          sign_count|
+--------------------+--------------------+
|[E14000530, E1400...|[28, 6, 17, 15, 5...|
+--------------------+--------------------+

The next step is to parse the arrays into columns for which I'm using the explode() as follows:

df1 = spark.createDataFrame(df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
    .select("wc_count.wc_code", "wc_count.sign_count")\
    .show()

which results in the following error:

+---------+----------+
|  wc_code|sign_count|
+---------+----------+
|E14000530|        28|
|E14000531|         6|
|E14000532|        17|
|E14000533|        15|
|E14000534|        54|
|E14000535|        12|
|E14000536|        34|
|E14000537|        10|
|E14000538|        32|
|E14000539|        29|
|E14000540|         3|
|E14000541|        10|
|E14000542|         8|
|E14000543|        13|
|E14000544|        15|
|E14000545|        19|
|E14000546|         8|
|E14000547|        28|
|E14000548|         7|
|E14000549|        13|
+---------+----------+
only showing top 20 rows

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-43-16ce85050366> in <module>
      2 
      3 df1 = spark.createDataFrame(df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
----> 4     .select("wc_count.wc_code", "wc_count.sign_count")\
      5     .show()
      6                            )

/usr/local/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    305         Py4JJavaError: ...
    306         """
--> 307         return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
    308 
    309     @since(1.3)

/usr/local/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    746             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    747         else:
--> 748             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    749         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    750         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

TypeError: 'NoneType' object is not iterable

Note that the table is displayed.

I don't get the error is I simply use df1 = ... but no matter what I do with the variable afterwards errors out in 'NoneType' object has no attribute 'something' .

If I don't assign the above to a variable and don't use df1 = spark.createDataFrame(), I don't get this error so I'm guessing something breaks when the variable gets created.

In case this is important, the df.printSchema() produces the following:

root
 |-- wc_code: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- sign_count: array (nullable = true)
 |    |-- element: long (containsNull = true)

Upvotes: 0

Views: 1848

Answers (2)

Stas Lambant
Stas Lambant

Reputation: 33

I think I've fixed the issue.

Running the code like this seems to have fixed it:

df1 = df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
    .select("wc_count.wc_code", "wc_count.sign_count")\

For some reason calling .show() at the end of it was messing with the newly created dataframe. Would be curious to know if there's a valid reason or it's a bug

Upvotes: 0

Som
Som

Reputation: 6338

Can you try below query-

 val df = spark.sql("select array('E14000530', 'E1400') as wc_code, array(28, 6, 17, 15) as sign_count")
    df.selectExpr("inline_outer(arrays_zip(wc_code, sign_count)) as (wc_code, sign_count)").show()

    /**
      * +---------+----------+
      * |  wc_code|sign_count|
      * +---------+----------+
      * |E14000530|        28|
      * |    E1400|         6|
      * |     null|        17|
      * |     null|        15|
      * +---------+----------+
      */

Upvotes: 1

Related Questions