Reputation: 101
i have a dataframe that i want to add to it a column that will indicate if the word "yes" is in that row text column (1 if the word is in that row 0 if not) i need to put 1 in check only if "yes" appear as a word and not as a substring or if "yes" is next to a punctuation mark(example: yes!) how can i do that in spark? for example:
id group text
1 a hey there
2 c no you can
3 a yes yes yes
4 b yes or no
5 b you need to say yes.
6 a yes you can
7 d yes!
8 c no&
9 b ok
the result on that will be:
id group text check
1 a hey there 0
2 c no you can 0
3 a yes yes yes 1
4 b yes or no 1
5 b you need to say yes. 1
6 a yes you can 1
7 d yes! 1
8 c no& 0
9 b ok 0
Upvotes: 1
Views: 1315
Reputation: 222722
I need to put
1
in check only if "yes" appear as a word and not as a substring.
You could address this by matching text
against a regex that uses word boundaries (\b
). This is handy regex feature that represents characters that separate words (spaces, punctuation marks, and so one).
In SQL, you would do:
select
t.*
case when text rlike '\byes\b' then 1 else 0 end as check
from mytable t
Upvotes: 2
Reputation: 75150
You can check with rlike
and cast to Integer:
import pyspark.sql.functions as F
df.withColumn("check",F.col("text").rlike("yes").cast("Integer")).show()
+---+-----+--------------------+-----+
| id|group| text|check|
+---+-----+--------------------+-----+
| 1| a| hey there| 0|
| 2| c| no you can| 0|
| 3| a| yes yes yes| 1|
| 4| b| yes or no| 1|
| 5| b|you need to say yes.| 1|
| 6| a| yes you can| 1|
| 7| d| yes!| 1|
| 8| c| no&| 0|
| 9| b| ok| 0|
+---+-----+--------------------+-----+
For edited question, you can try with higher order functions
:
import string
import re
pat = '|'.join([re.escape(i) for i in list(string.punctuation)])
(df.withColumn("text1",F.regexp_replace(F.col("text"),pat,""))
.withColumn("Split",F.split("text1"," "))
.withColumn("check",
F.expr('''exists(Split,x-> replace(x,"","") = "yes")''').cast("Integer"))
.drop("Split","text1")).show()
+---+-----+--------------------+-----+
| id|group| text|check|
+---+-----+--------------------+-----+
| 1| a| hey there| 0|
| 2| c| no you can| 0|
| 3| a| yes yes yes| 1|
| 4| b| yes or no| 1|
| 5| b|you need to say yes.| 1|
| 6| a| yes you can| 1|
| 7| d| yes!| 1|
| 8| c| no&| 0|
| 9| b| okyes| 0|
+---+-----+--------------------+-----+
Upvotes: 3