Reputation: 1825
I ask the similarity questions before, but for some reasons, It is sad that I have to reimplement it in PySpark
.
For example,
app col1
app1 anybody love me?
app2 I hate u
app3 this hat is good
app4 I don't like this one
app5 oh my god
app6 damn you.
app7 such nice girl
app8 xxxxx
app9 pretty prefect
app10 don't love me.
app11 xxx anybody?
I want to match a keyword list like ['anybody', 'love', 'you', 'xxx', 'don't']
and select the matched keyword result as a new column, named keyword as follows:
app keyword
app1 [anybody, love]
app4 [don't]
app6 [you]
app8 [xxx]
app10 [don't, love]
app11 [xxx]
As the accepted answer the suitable way I can do is create a temporary dataframe which is converted by a string list then inner join
these two dataframe together.
And select
the rows of app
and keyword
that are matched in the condition.
-- Hiveql implementation
select t.app, k.keyword
from mytable t
inner join (values ('anybody'), ('you'), ('xxx'), ('don''t')) as k(keyword)
on t.col1 like conca('%', k.keyword, '%')
But I am not familiar with PySpark
and awkward to reimplement it.
Could anyone help me?
Thanks in advances.
Upvotes: 2
Views: 3538
Reputation: 7336
Please find below two possible approaches:
Option 1
The first option is to use the dataframe API to implement the analogous join as in your previous question. Here we convert the keywords
list into a dataframe and then join it with the large dataframe (notice that we broadcast the small dataframe to ensure better performance):
from pyspark.sql.functions import broadcast
df = spark.createDataFrame([
["app1", "anybody love me?"],
["app4", "I don't like this one"],
["app5", "oh my god"],
["app6", "damn you."],
["app7", "such nice girl"],
["app8", "xxxxx"],
["app10", "don't love me."]
]).toDF("app", "col1")
# create keywords dataframe
kdf = spark.createDataFrame([(k,) for k in keywords], "key string")
# +-----+
# | key|
# +-----+
# | xxx|
# |don't|
# +-----+
df.join(broadcast(kdf), df["col1"].contains(kdf["key"]), "inner")
# +-----+---------------------+-----+
# |app |col1 |key |
# +-----+---------------------+-----+
# |app4 |I don't like this one|don't|
# |app8 |xxxxx |xxx |
# |app10|don't love me. |don't|
# +-----+---------------------+-----+
The join condition is based on contains function of the Column class.
Option 2
You also can use PySpark high-order function filter in combination with rlike within an expr:
from pyspark.sql.functions import lit, expr, array
df = spark.createDataFrame([
["app1", "anybody love me?"],
["app4", "I don't like this one"],
["app5", "oh my god"],
["app6", "damn you."],
["app7", "such nice girl"],
["app8", "xxxxx"],
["app10", "don't love me."]
]).toDF("app", "col1")
keywords = ["xxx", "don't"]
df.withColumn("keywords", array([lit(k) for k in keywords])) \
.withColumn("keywords", expr("filter(keywords, k -> col1 rlike k)")) \
.where("size(keywords) > 0") \
.show(10, False)
# +-----+---------------------+--------+
# |app |col1 |keywords|
# +-----+---------------------+--------+
# |app4 |I don't like this one|[don't] |
# |app8 |xxxxx |[xxx] |
# |app10|don't love me. |[don't] |
# +-----+---------------------+--------+
Explanation
with array([lit(k) for k in keywords])
we generate an array which contains the keywords that our search will be based on and then we append it to the existing dataframe using withColumn
.
next with expr("size(filter(keywords, k -> col1 rlike k)) > 0")
we go through the items of keywords trying to figure out if any of them is present in col1 text. If that is true filter
will return one or more items and size
will be greater than 0 which consists our where
condition for retrieving the records.
Upvotes: 1