Reputation: 1599
I would like to explode a column to rows in a dataframe on pyspark hive. There are two columns in the dataframe. The column "business_id" is a string. The column "sports_info" is a struct type, each element value is an array of string.
Data:
business_id sports_info
"abc-123" {"sports_type":
["{sport_name:most_recent,
sport_events:[{sport_id:568, val:10.827},{id:171,score:8.61}]}"
]
}
I need to get a dataframe like:
business_id. sport_id
"abc-123" 568
"abc-123" 171
I defined:
schema = StructType([ \
StructField("sports_type",ArrayType(),True)
])
df = spark.createDataFrame(data=data, schema=schema) # I am not sure how to create the df
df.printSchema()
df.show(truncate=False)
def get_ids(val):
sports_type = 'sports_type'
sport_events = 'sport_events'
sport_id = 'sport_id'
sport_ids_vals = eval(val.sports_type[0])['sport_events']
ids = [s['sport_id'] for s in sport_ids_scores]
return ids
df2 = df.withColumn('sport_new', F.udf(lambda x: get_ids(x),
ArrayType(ArrayType(StringType())))('sports_info'))
How could I create the df and extract/explode the inner nested elements?
Upvotes: 1
Views: 132
Reputation: 5125
df2 = df.withColumn('sport_new', expr("transform (sports_type, x -> regexp_extract( x, 'sport_id:([0-9]+)',1))")).show()
Explained:
expr( #use a SQL expression, only way to access transform (pre spark 3)
"transform ( # run a SQL function on an array
sports_type, # declare column to use
x # declare the name of the variable to use for each element in the array
-> # Start writing SQL code to run on each element in the array
regexp_extract( # user SQL regex functions to pull out from the string
x, #string to run regex on
'sport_id:([0-9]+)',1))" # find sport_id and capture the number following it.
)
THis will likely run faster than a UDF as it can be vectorized.
Upvotes: 1