How to regexp_extract if a matching pattern resides anywhere in the string - pyspark

I was trying to get some insights on regexp_extract in pyspark and I tried to do a check with this option to get better understanding.

Below is my dataframe

data = [('2345', 'Checked|by John|for kamal'),
('2398', 'Checked|by John|for kamal '),
('2328', 'Verified|by Srinivas|for kamal than some random text'),        
('3983', 'Verified|for Stacy|by John')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+-----------------------------------------------------+
|  ID|               Notes                                 |
+----+-----------------------------------------------------+
|2345|Checked|by John|for kamal                            |
|2398|Checked|by John|for kamal                            |
|2328|Verified|by Srinivas|for kamal than some random text |
|3983|Verified|for Stacy|by John                           |
+----+-----------------------------------------------------+

So here I was trying to identify whether an ID is checked or verified by John

With the help of SO members I was able to crack the use of regexp_extract and came to below solution

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(Checked|Verified)(\\|by John)', 1))

result.show()

+----+------------------------------------------------+------------+
|  ID|               Notes                                |Employee|
+----+------------------------------------------------+------------+
|2345|Checked|by John|for kamal                           | Checked|
|2398|Checked|by John|for kamal                           | Checked|
|2328|Verified|by Srinivas|for kamal than some random text|        |
|3983|Verified|for Stacy|by John                          |        |
+----+--------------------+----------------------------------------+

For few ID's this gives me perfect result ,But for last ID it didn't print Verified. Could someone please let me know whether any other action needs to be performed in the mentioned regular expression?

What I feel is (Checked|Verified)(\\|by John) is matching only adjacent values. I tried * and $, still it didn't print Verified for ID 3983.

Upvotes: 1

Answers (3)

wwnde

Reputation: 26676

Another way is to check if the column Notes contains a string by John:

   df.withColumn('Employee',F.when(col('Notes').like('%Checked|by John%'), 'Checked').when(col('Notes').like('%by John'), 'Verified').otherwise(" ")).show(truncate=False)

+----+----------------------------------------------------+--------+
|ID  |Notes                                               |Employee|
+----+----------------------------------------------------+--------+
|2345|Checked|by John|for kamal                           |Checked |
|2398|Checked|by John|for kamal                           |Checked |
|2328|Verified|by Srinivas|for kamal than some random text|        |
|3983|Verified|for Stacy|by John                          |Verified|
+----+----------------------------------------------------+--------+

Upvotes: 1

mck

Reputation: 42352

You can try this regex:

import pyspark.sql.functions as F

result = df.withColumn('Employee', F.regexp_extract('Notes', '(Checked|Verified)\\|.*by John', 1))

result.show()
+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|Checked|by John|f...| Checked|
|2398|Checked|by John|f...| Checked|
|2328|Verified|by Srini...|        |
|3983|Verified|for Stac...|Verified|
+----+--------------------+--------+

Upvotes: 1

Tim Biegeleisen

Reputation: 521093

I would have phrased the regex as:

(Checked|Verified)\b.*\bby John

Demo

This pattern finds Checked/Verified followed by by John with the two separated by any amount of text. Note that I just use word boundaries here instead of pipes.

Updated code:

result = df.withColumn('Employee', regexp_extract(col('Notes'), '\b(Checked|Verified)\b.*\bby John', 1))

Upvotes: 1

How to regexp_extract if a matching pattern resides anywhere in the string - pyspark

Answers (3)

Demo

Related Questions