Reputation: 643
I have a PySpark dataframe df1
. Its printSchema()
shows as below.
df1.printSchema()
root
|-- parent: struct (nullable = true)
| |-- childa: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| |-- childb: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| |-- childc: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
| |-- childd: struct (nullable = true)
| | |-- x: string (nullable = true)
| | |-- y: string (nullable = true)
| | |-- z: string (nullable = true)
df1.show(10,False)
----------------------------------------------------------------
|parent |
----------------------------------------------------------------
|[,[x_value, y_value, z_value], ,[x_value, y_value, z_value]] |
----------------------------------------------------------------
The df1.show()
shows that childb and childd are not null.
The below approach is giving me all the child struct field names into a list, which answered my above first requirement.
spark.sql("""select parent.* from df1""").schema.fieldNames()
Output:
[childa, childb, childc, childd]
Now I want to get only those child struct field names which are not null. I am expecting only childb and childd into a list.
Expected Output: [childb, childd]
Upvotes: 1
Views: 1016
Reputation: 42352
You can do check whether the fields are null using a filter and count:
non_null_fields = [
field
for field in df.select('parent.*').schema.fieldNames()
if df.filter('parent.%s is null' % field).count() == 0
]
which gives
['childb', 'childd']
Upvotes: 1