user3448011
user3448011

Reputation: 1599

Extract and explode inner nested element as rows from string nested structure

I would like to explode a column to rows in a dataframe on pyspark hive. There are two columns in the dataframe. The column "business_id" is a string. The column "sports_info" is a struct type, each element value is an array of string.

Data:

business_id     sports_info
 "abc-123"       {"sports_type":
                  ["{sport_name:most_recent,
                    sport_events:[{sport_id:568, val:10.827},{id:171,score:8.61}]}"
                   ]
                 }

I need to get a dataframe like:

 business_id.    sport_id
  "abc-123"       568
  "abc-123"       171

I defined:

schema = StructType([ \
    StructField("sports_type",ArrayType(),True)
])
 
df = spark.createDataFrame(data=data, schema=schema) # I am not sure how to create the df
df.printSchema()
df.show(truncate=False)

def get_ids(val):
    sports_type = 'sports_type'
    sport_events = 'sport_events'
    sport_id = 'sport_id'
    sport_ids_vals = eval(val.sports_type[0])['sport_events']
    ids = [s['sport_id'] for s in sport_ids_scores]
    return ids

df2 = df.withColumn('sport_new', F.udf(lambda x: get_ids(x), 
                                         ArrayType(ArrayType(StringType())))('sports_info'))

How could I create the df and extract/explode the inner nested elements?

Upvotes: 1

Views: 132

Answers (1)

Matt Andruff
Matt Andruff

Reputation: 5125

df2 = df.withColumn('sport_new', expr("transform (sports_type, x -> regexp_extract( x, 'sport_id:([0-9]+)',1))")).show()

Explained:

expr( #use a SQL expression, only way to access transform (pre spark 3)
 "transform ( # run a SQL function on an array
  sports_type, # declare column to use
  x # declare the name of the variable to use for each element in the array
  -> # Start writing SQL code to run on each element in the array
  regexp_extract( # user SQL regex functions to pull out from the string
  x, #string to run regex on
  'sport_id:([0-9]+)',1))" # find sport_id and capture the number following it.
  )

THis will likely run faster than a UDF as it can be vectorized.

Upvotes: 1

Related Questions