Krzysztof Fajst
Krzysztof Fajst

Reputation: 161

pyspark dataframe filling group with exact value

I have following pyspark dataframe

+-----------+------------------+
|search_term|keyword_match_type|
+-----------+------------------+
|        st1|             EXACT|
|        st1|               DSA|
|        st2|               DSA|
|        st2|               DSA|
|        st2|               DSA|
+-----------+------------------+

I want to make a new column in this dataframe which will look like this

+-----------+------------------+-----------------+
|search_term|keyword_match_type|exact_match_exist|
+-----------+------------------+-----------------+
|        st1|             EXACT|             True|
|        st1|               DSA|             True|
|        st2|               DSA|            False|
|        st2|               DSA|            False|
|        st2|               DSA|            False|
+-----------+------------------+-----------------+

So if at least one EXACT exist in group st1 in col keyword_match_type, whole group st1 in a new column should be filled with true.

the below code works fine but it does not do it over groups

df.withColumn("exact_match_exist", when(col('keyword_match_type').isin("EXACT"), lit('True')).otherwise(lit('False'))).show()

But when I try

w = Window().partitionBy("search_term")
df.withColumn("exact_match_exist", when(col('keyword_match_type').over(w).isin("EXACT"), lit('True')).otherwise(lit('False'))).show()

I'm getting

AnalysisException: Expression 'keyword_match_type#25116' not supported within a window function.;

Any idea how to overcome this ??

Upvotes: 1

Views: 272

Answers (2)

wwnde
wwnde

Reputation: 26676

Another way, array and

win=Window.partitionBy('search_term')
df.withColumn('exact_match_exist',array_contains(collect_list('keyword_match_type').over(win),'EXACT')).show()

+-----------+------------------+-----------------+
|search_term|keyword_match_type|exact_match_exist|
+-----------+------------------+-----------------+
|        st1|             EXACT|             true|
|        st1|               DSA|             true|
|        st2|               DSA|            false|
|        st2|               DSA|            false|
|        st2|               DSA|            false|
+-----------+------------------+-----------------+

Upvotes: 1

blackbishop
blackbishop

Reputation: 32660

In Spark 3, you can use any function :

from pyspark.sql import Window, functions as F

w = Window.partitionBy("search_term")

df1 = df.withColumn(
    "exact_match_exist",
    F.expr("any(keyword_match_type = 'EXACT')").over(w)
)

df1.show()
# +-----------+------------------+-----------------+
# |search_term|keyword_match_type|exact_match_exist|
# +-----------+------------------+-----------------+
# |        st1|             EXACT|             true|
# |        st1|               DSA|             true|
# |        st2|               DSA|            false|
# |        st2|               DSA|            false|
# |        st2|               DSA|            false|
# +-----------+------------------+-----------------+

Or using conditional count like this:

df1 = df.withColumn(
    "exact_match_exist",
    F.count(F.when(F.col("keyword_match_type") == "EXACT", 1)).over(w) > 0
)

Upvotes: 2

Related Questions