Reputation: 523
How I can modify the following code:
c1_arr = F.col('col1')
c2_arr = F.split(F.trim('col2'), '\s+')
arr_of_struct = F.transform(
c1_arr,
lambda x: F.struct(
F.size(F.array_intersect(c2_arr, F.split(F.trim(x), '\s+'))).alias('cnt'),
x.alias('val'),
)
)
top_val = F.sort_array(arr_of_struct, False)[0]
top_val
gives me an element of col1
where it has the most common token with col2
's element.
Then I use the following code to get my desired result:
df = df.select(
'*',
F.when(((top_val['cnt'] > 0)),top_val['val']).alias('match'))
For example consider the following data:
col1 col2
["come and get", "computer", "come and get more" ] "come for good"
["summer is hot", "summer is too hot", "hot weather"] "hot tea"
["summer is hot", "summer is too hot", "hot weather"] "hot summer"
Desired output:
col1 col2 match
["come and get", "computer", come and get more" ] "come for good" "come and get"
["summer is hot", "summer is too hot", "hot weather"] "hot tea" "hot weather"
["summer is hot", "summer is too hot", "hot weather"] "hot summer" "summer is hot"
Now suppose I use map_from_arrays
function to join another column to col1
. Then instead of col1
I have:
col1_modify
{"come and get":"this is one", "computer":"useful device" ,"come and get more": "sure things"}
{"summer is hot":"about season" ,"summer is too hot":"oh yea","hot weather": "tempters"}
{"summer is hot":"about season","summer is too hot":"oh yea","hot weather":"tempters"}
No still I want to match with keys but I want to create a new column contains values:
output will be:
match match_value
"come and get" "this is one"
"hot weather" "tempters"
"summer is hot" "about season"
in case of not using map
function the third column can be like below, the goal is to find the right index:
col3
["this is one","useful device" , "sure things"]
["about season" ,"oh yea", "tempters"]
["about season","oh yea","tempters"]
Upvotes: 1
Views: 189
Reputation: 9308
Assuming you have a dataframe like this.
data = [
[["come and get", "computer", "come and get more" ], "come for good", ["this is one","useful device" , "sure things"]],
[["summer is hot", "summer is too hot", "hot weather"], "hot tea", ["about season" ,"oh yea", "tempters"]],
[["summer is hot", "summer is too hot", "hot weather"], "hot summer", ["about season","oh yea","tempters"]]
]
df = spark.createDataFrame(data, ['col1', 'col2', 'col3'])
First, I use your method to get the number of intersections. Then, use array_max
and array_position
to get the index where it has the most intersections.
*The Spark SQL's array_position
only accepts static value for the 2nd argument, so I had to use the SQL expressions.
c1_arr = F.col('col1')
c2_arr = F.split(F.trim('col2'), '\s+')
df = (df.withColumn('cnt', F.transform(c1_arr, lambda x: F.size(F.array_intersect(c2_arr, F.split(F.trim(x), '\s+')))).alias('cnt'))
.withColumn('best_match', F.expr('array_position(cnt, array_max(cnt))'))
.select('col1', 'col2', 'col3',
F.col('col1').getItem(F.col('best_match') - 1).alias('match'),
F.col('col3').getItem(F.col('best_match') - 1).alias('match_value'))
)
array_position
is 1-indexed, so I need to -1 when I use in getItem
.array_position
will return the first matching value.)Upvotes: 1