shreder1921
shreder1921

Reputation: 101

find if a column(called text) has a word in it pyspark

i have a df and i need to count how many times each group has the word yes. i need to find the word (not as a substring) and if it has a punctuation mark next to it i need to count that as well for example:

id  group  text
1   a       hey there
2   c       no you can
3   a       yes yes yes
4   b       yes or no
5   b       you need to say yes.
6   a       yes you can
7   d       yes!
8   c       no&
9   b       ok

result:

group  count
a      2
b      2
c      0
d      1

i tried this:

sql_q = spark.sql("select group, count(*) as count from my_table where text LIKE ' yes' or text LIKE 'yes ' or text LIKE ' yes ' group by group")

Upvotes: 1

Views: 378

Answers (2)

Shubham Jain
Shubham Jain

Reputation: 5526

Query remains same in pyspark or scala. Pyspark syntax

sqlQ = spark.sql(
      """
        select group, sum(
         case when (text rlike '(?i)^.*yes.*$') then 1 else 0 end
        ) as count
        from my_table group by group
      """)
sqlQ.show(20,False)

Upvotes: 1

Som
Som

Reputation: 6323

Try this-

val sql_q = spark.sql(
      """
        |select group, sum(
        |  case when (text rlike '(?i)^.*yes.*$') then 1 else 0 end
        | ) as count
        |from my_table group by group
      """.stripMargin)
    sql_q.show(false)

    /**
      * +-----+-----+
      * |group|count|
      * +-----+-----+
      * |a    |2    |
      * |c    |0    |
      * |d    |1    |
      * |b    |2    |
      * +-----+-----+
      */

Upvotes: 1

Related Questions