how modify this code to get get value and keys of map_from_arrays?

Question

How I can modify the following code:

c1_arr = F.col('col1')
c2_arr = F.split(F.trim('col2'), '\s+')
arr_of_struct = F.transform(
    c1_arr,
    lambda x: F.struct(
        F.size(F.array_intersect(c2_arr, F.split(F.trim(x), '\s+'))).alias('cnt'),
        x.alias('val'),
    )
)
top_val = F.sort_array(arr_of_struct, False)[0]

top_val gives me an element of col1 where it has the most common token with col2's element.

Then I use the following code to get my desired result:

df = df.select(
    '*',
    F.when(((top_val['cnt'] > 0)),top_val['val']).alias('match'))

For example consider the following data:

col1                                                       col2
["come and get", "computer", "come and get more" ]       "come for good"
["summer is hot", "summer is too hot", "hot weather"]       "hot tea"
["summer is hot", "summer is too hot", "hot weather"]       "hot summer"

Desired output:

col1              col2                                                          match
["come and get", "computer", come and get more" ]       "come for good"         "come and get"
["summer is hot", "summer is too hot", "hot weather"]       "hot tea"           "hot weather"
["summer is hot", "summer is too hot", "hot weather"]       "hot summer"       "summer is hot"

Now suppose I use map_from_arrays function to join another column to col1 . Then instead of col1 I have:

col1_modify
{"come and get":"this is one", "computer":"useful device" ,"come and get more": "sure things"}
{"summer is hot":"about season" ,"summer is too hot":"oh yea","hot weather": "tempters"}
{"summer is hot":"about season","summer is too hot":"oh yea","hot weather":"tempters"}

No still I want to match with keys but I want to create a new column contains values:

output will be:

  match               match_value
"come and get"          "this is one"
    "hot weather"       "tempters"
 "summer is hot"         "about season"

in case of not using map function the third column can be like below, the goal is to find the right index:

col3
["this is one","useful device" , "sure things"]
["about season" ,"oh yea", "tempters"]
["about season","oh yea","tempters"]

Emma · Accepted Answer

Assuming you have a dataframe like this.

data = [
[["come and get", "computer", "come and get more" ], "come for good", ["this is one","useful device" , "sure things"]],
[["summer is hot", "summer is too hot", "hot weather"], "hot tea", ["about season" ,"oh yea", "tempters"]],
[["summer is hot", "summer is too hot", "hot weather"], "hot summer", ["about season","oh yea","tempters"]]
]

df = spark.createDataFrame(data, ['col1', 'col2', 'col3'])

First, I use your method to get the number of intersections. Then, use array_max and array_position to get the index where it has the most intersections.

*The Spark SQL's array_position only accepts static value for the 2nd argument, so I had to use the SQL expressions.

c1_arr = F.col('col1')
c2_arr = F.split(F.trim('col2'), '\s+')
df = (df.withColumn('cnt', F.transform(c1_arr, lambda x: F.size(F.array_intersect(c2_arr, F.split(F.trim(x), '\s+')))).alias('cnt'))
        .withColumn('best_match', F.expr('array_position(cnt, array_max(cnt))'))
        .select('col1', 'col2', 'col3', 
          F.col('col1').getItem(F.col('best_match') - 1).alias('match'),
          F.col('col3').getItem(F.col('best_match') - 1).alias('match_value'))
)

Note that array_position is 1-indexed, so I need to -1 when I use in getItem.
Also, the tie breaker is not specified in this question and all of rows have ties in this example, so the result could be different from what you have. (I believe array_position will return the first matching value.)

how modify this code to get get value and keys of map_from_arrays?

Answers (1)

Related Questions