RandomTask
RandomTask

Reputation: 509

Query a pandas dataframe column for a text phrase that may or may not have words within that phrase

Goal: To query a pandas dataframe column for a text phrase that may or may not have words within that phrase. At a high level a phrase is "word1 word2". Between word1 and word 2 there may or may not be other words.

This sounds like a dupe, however I tried the SO answers here:

How to extract a substring from inside a string in Python?

Regular expression: matching and grouping a variable number of space separated words

Match text between two strings with regular expression

Extract text information between two define text

And a few others and they all miss the case where there are NO words between word1 and word2.

These highly voted solutions all rely on (.+?) between word1 and word2.

Ex: word1(.+?)word2

The above works well if there ARE 1+ words between word1 and word2. However, if there are NO words between word1 and word2 then it doesn't return any results, however I would like it to return results in this particular case as well because the text phrase contains word1 word2.

Also, the data will be cleaned in advance so no need to consider capitalization, commas or other spurious chars.

My code and trials are below. In place of word1 word2 I'm using "pieces delivered" as the text phrase.

Note, they all miss the first example where there are no intervening words between "pieces delivered". It should return "some pieces delivered on time" along with the other rows with "pieces ... delivered".

Thanks in advance.

import pandas as pd
df = pd.Series(['a', 'b', 'c', 'some pieces delivered on time', 'all pieces not delivered', 'most pieces were never delivered at all', 'the pieces will never ever be delivered', 'some delivered', 'i received broken pieces'])

print("Baseline - Desired results SHOULD contain:\n", df.iloc[3:7])

# The following options all miss one or more rows from the desired results. 
# Just uncomment rgx = to run a regex. 
rgx = r'pieces\s(.*?)\sdelivered'
#rgx = r'pieces\s(\w*)\sdelivered'
#rgx = r'pieces\s(\w*)+\sdelivered'
#rgx = r'pieces\s(\w)*\sdelivered'
#rgx = r'pieces\s(\w+\s)+\sdelivered'
#rgx = r'pieces\s(.*)\sdelivered'
#rgx = r'pieces\s+((%s).*?)\sdelivered'

df2 = df[df.str.contains(rgx)]
print("\nActual results were:\n", df2)

Upvotes: 0

Views: 676

Answers (1)

DYZ
DYZ

Reputation: 57085

The second '\s' is in the wrong position. You need it only if the two words are not adjacent:

df[df.str.contains(r'pieces\s(?:.+?\s)?delivered')]
#3              some pieces delivered on time
#4                   all pieces not delivered
#5    most pieces were never delivered at all
#6    the pieces will never ever be delivered

Upvotes: 1

Related Questions