Reputation: 1
I have this structtype with over a 1000 fields, every field type is a string.
root
|-- mac: string (nullable = true)
|-- kv: struct (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_FEAT_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_FEAT_CODE: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_HELP_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_HELP_CODE: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_SYST_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_SYST_CODE: string (nullable = true)
| |-- FTP_SERVER_HELLO_B64: string (nullable = true)
| |-- FTP_STATUS_HELLO_CODE: string (nullable = true)
| |-- HTML_LOGIN_FORM_ACTION_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_DETECTION_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_INPUT_PASSWORD_NAME_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_INPUT_TEXT_NAME_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_METHOD_0: string (nullable = true)
| |-- HTML_REDIRECT_TYPE_0: string (nullable = true)
I want to select only the fields which are non null, and some identifier of which fields are non-null. Is there anyway to convert this struct to an array without explicitly referring to each element ?
Upvotes: 0
Views: 820
Reputation: 35229
I'd use an udf
:
from pyspark.sql.types import *
from pyspark.sql.functions import udf
as_array = udf(
lambda arr: [x for x in arr if x is not None],
ArrayType(StringType()))
df.withColumn("arr", as_array(df["kv"])))
Upvotes: 1