Reputation: 33
I've parsed a nested json file using the following code:
df = df.select(df.data.attributes.signatures_by_constituency.ons_code.alias("wc_code"), \
df.data.attributes.signatures_by_constituency.signature_count.alias("sign_count")) \
This results in the following dataframe with values sitting in arrays.
+--------------------+--------------------+
| wc_code| sign_count|
+--------------------+--------------------+
|[E14000530, E1400...|[28, 6, 17, 15, 5...|
+--------------------+--------------------+
The next step is to parse the arrays into columns for which I'm using the explode() as follows:
df1 = spark.createDataFrame(df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
.select("wc_count.wc_code", "wc_count.sign_count")\
.show()
which results in the following error:
+---------+----------+
| wc_code|sign_count|
+---------+----------+
|E14000530| 28|
|E14000531| 6|
|E14000532| 17|
|E14000533| 15|
|E14000534| 54|
|E14000535| 12|
|E14000536| 34|
|E14000537| 10|
|E14000538| 32|
|E14000539| 29|
|E14000540| 3|
|E14000541| 10|
|E14000542| 8|
|E14000543| 13|
|E14000544| 15|
|E14000545| 19|
|E14000546| 8|
|E14000547| 28|
|E14000548| 7|
|E14000549| 13|
+---------+----------+
only showing top 20 rows
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-16ce85050366> in <module>
2
3 df1 = spark.createDataFrame(df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
----> 4 .select("wc_count.wc_code", "wc_count.sign_count")\
5 .show()
6 )
/usr/local/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
305 Py4JJavaError: ...
306 """
--> 307 return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
308
309 @since(1.3)
/usr/local/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema = self._createFromLocal(map(prepare, data), schema)
749 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
TypeError: 'NoneType' object is not iterable
Note that the table is displayed.
I don't get the error is I simply use df1 = ... but no matter what I do with the variable afterwards errors out in 'NoneType' object has no attribute 'something' .
If I don't assign the above to a variable and don't use df1 = spark.createDataFrame(), I don't get this error so I'm guessing something breaks when the variable gets created.
In case this is important, the df.printSchema() produces the following:
root
|-- wc_code: array (nullable = true)
| |-- element: string (containsNull = true)
|-- sign_count: array (nullable = true)
| |-- element: long (containsNull = true)
Upvotes: 0
Views: 1848
Reputation: 33
I think I've fixed the issue.
Running the code like this seems to have fixed it:
df1 = df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
.select("wc_count.wc_code", "wc_count.sign_count")\
For some reason calling .show() at the end of it was messing with the newly created dataframe. Would be curious to know if there's a valid reason or it's a bug
Upvotes: 0
Reputation: 6338
Can you try below query-
val df = spark.sql("select array('E14000530', 'E1400') as wc_code, array(28, 6, 17, 15) as sign_count")
df.selectExpr("inline_outer(arrays_zip(wc_code, sign_count)) as (wc_code, sign_count)").show()
/**
* +---------+----------+
* | wc_code|sign_count|
* +---------+----------+
* |E14000530| 28|
* | E1400| 6|
* | null| 17|
* | null| 15|
* +---------+----------+
*/
Upvotes: 1