Get only those field names which are not null

Question

I have a PySpark dataframe df1. Its printSchema() shows as below.

df1.printSchema()

root
 |-- parent: struct (nullable = true)
 |    |-- childa: struct (nullable = true)
 |    |    |-- x: string (nullable = true)
 |    |    |-- y: string (nullable = true)
 |    |    |-- z: string (nullable = true)
 |    |-- childb: struct (nullable = true)
 |    |    |-- x: string (nullable = true)
 |    |    |-- y: string (nullable = true)
 |    |    |-- z: string (nullable = true)
 |    |-- childc: struct (nullable = true)
 |    |    |-- x: string (nullable = true)
 |    |    |-- y: string (nullable = true)
 |    |    |-- z: string (nullable = true)
 |    |-- childd: struct (nullable = true)
 |    |    |-- x: string (nullable = true)
 |    |    |-- y: string (nullable = true)
 |    |    |-- z: string (nullable = true)

df1.show(10,False)

----------------------------------------------------------------
|parent                                                        |
----------------------------------------------------------------
|[,[x_value, y_value, z_value], ,[x_value, y_value, z_value]]  |
----------------------------------------------------------------

The df1.show() shows that childb and childd are not null.

I am able get all the child struct field names like (childa, childb, childc, childd).
And also I want to get only those child struct field names which are not null.

The below approach is giving me all the child struct field names into a list, which answered my above first requirement.

spark.sql("""select parent.* from df1""").schema.fieldNames()
Output:
[childa, childb, childc, childd]

Now I want to get only those child struct field names which are not null. I am expecting only childb and childd into a list.

Expected Output: [childb, childd]

mck · Accepted Answer

You can do check whether the fields are null using a filter and count:

non_null_fields = [
    field
    for field in df.select('parent.*').schema.fieldNames()
    if df.filter('parent.%s is null' % field).count() == 0
]

which gives

['childb', 'childd']

Get only those field names which are not null

Answers (1)

Related Questions