user2291165
user2291165

Reputation: 1

converting all fields in a structtype to array

I have this structtype with over a 1000 fields, every field type is a string.

root
 |-- mac: string (nullable = true)
 |-- kv: struct (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_FEAT_B64: string (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_FEAT_CODE: string (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_HELP_B64: string (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_HELP_CODE: string (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_SYST_B64: string (nullable = true)
 |    |-- FTP_SERVER_ANAUTHORIZED_SYST_CODE: string (nullable = true)
 |    |-- FTP_SERVER_HELLO_B64: string (nullable = true)
 |    |-- FTP_STATUS_HELLO_CODE: string (nullable = true)
 |    |-- HTML_LOGIN_FORM_ACTION_0: string (nullable = true)
 |    |-- HTML_LOGIN_FORM_DETECTION_0: string (nullable = true)
 |    |-- HTML_LOGIN_FORM_INPUT_PASSWORD_NAME_0: string (nullable = true)
 |    |-- HTML_LOGIN_FORM_INPUT_TEXT_NAME_0: string (nullable = true)
 |    |-- HTML_LOGIN_FORM_METHOD_0: string (nullable = true)
 |    |-- HTML_REDIRECT_TYPE_0: string (nullable = true)

I want to select only the fields which are non null, and some identifier of which fields are non-null. Is there anyway to convert this struct to an array without explicitly referring to each element ?

Upvotes: 0

Views: 820

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

I'd use an udf:

from pyspark.sql.types import *
from pyspark.sql.functions import udf

as_array = udf(
    lambda arr: [x for x in arr if x is not None], 
    ArrayType(StringType()))


df.withColumn("arr", as_array(df["kv"])))

Upvotes: 1

Related Questions