Bowen Peng
Bowen Peng

Reputation: 1825

PySpark: select a column based on the condition another columns values match some specific values, then create the match result as a new column

I ask the similarity questions before, but for some reasons, It is sad that I have to reimplement it in PySpark.
For example,

app      col1

app1     anybody love me?
app2     I hate u
app3     this hat is good
app4     I don't like this one
app5     oh my god
app6     damn you.
app7     such nice girl
app8     xxxxx
app9     pretty prefect
app10    don't love me.
app11    xxx anybody?

I want to match a keyword list like ['anybody', 'love', 'you', 'xxx', 'don't'] and select the matched keyword result as a new column, named keyword as follows:

app      keyword

app1     [anybody, love]
app4     [don't]
app6     [you]
app8     [xxx]
app10    [don't, love]
app11    [xxx]

As the accepted answer the suitable way I can do is create a temporary dataframe which is converted by a string list then inner join these two dataframe together.
And select the rows of app and keyword that are matched in the condition.

-- Hiveql implementation
select t.app, k.keyword
from  mytable t
inner join (values ('anybody'), ('you'), ('xxx'), ('don''t')) as k(keyword)
    on t.col1 like conca('%', k.keyword, '%')


But I am not familiar with PySpark and awkward to reimplement it.
Could anyone help me?
Thanks in advances.

Upvotes: 2

Views: 3538

Answers (1)

abiratsis
abiratsis

Reputation: 7336

Please find below two possible approaches:

Option 1

The first option is to use the dataframe API to implement the analogous join as in your previous question. Here we convert the keywords list into a dataframe and then join it with the large dataframe (notice that we broadcast the small dataframe to ensure better performance):

from pyspark.sql.functions import broadcast

df = spark.createDataFrame([
  ["app1", "anybody love me?"],
  ["app4", "I don't like this one"],
  ["app5", "oh my god"],
  ["app6", "damn you."],
  ["app7", "such nice girl"],
  ["app8", "xxxxx"],
  ["app10", "don't love me."]
]).toDF("app", "col1")

# create keywords dataframe
kdf = spark.createDataFrame([(k,) for k in keywords], "key string")

# +-----+
# |  key|
# +-----+
# |  xxx|
# |don't|
# +-----+

df.join(broadcast(kdf), df["col1"].contains(kdf["key"]), "inner")

# +-----+---------------------+-----+
# |app  |col1                 |key  |
# +-----+---------------------+-----+
# |app4 |I don't like this one|don't|
# |app8 |xxxxx                |xxx  |
# |app10|don't love me.       |don't|
# +-----+---------------------+-----+

The join condition is based on contains function of the Column class.

Option 2

You also can use PySpark high-order function filter in combination with rlike within an expr:

from pyspark.sql.functions import lit, expr, array

df = spark.createDataFrame([
  ["app1", "anybody love me?"],
  ["app4", "I don't like this one"],
  ["app5", "oh my god"],
  ["app6", "damn you."],
  ["app7", "such nice girl"],
  ["app8", "xxxxx"],
  ["app10", "don't love me."]
]).toDF("app", "col1")

keywords = ["xxx", "don't"]

df.withColumn("keywords", array([lit(k) for k in keywords])) \
  .withColumn("keywords", expr("filter(keywords, k -> col1 rlike k)")) \
  .where("size(keywords) > 0") \
  .show(10, False)

# +-----+---------------------+--------+
# |app  |col1                 |keywords|
# +-----+---------------------+--------+
# |app4 |I don't like this one|[don't] |
# |app8 |xxxxx                |[xxx]   |
# |app10|don't love me.       |[don't] |
# +-----+---------------------+--------+

Explanation

  1. with array([lit(k) for k in keywords]) we generate an array which contains the keywords that our search will be based on and then we append it to the existing dataframe using withColumn.

  2. next with expr("size(filter(keywords, k -> col1 rlike k)) > 0") we go through the items of keywords trying to figure out if any of them is present in col1 text. If that is true filter will return one or more items and size will be greater than 0 which consists our where condition for retrieving the records.

Upvotes: 1

Related Questions