Reputation: 22694
df (Pandas Dataframe) has three rows.
col_name
"This is Donald."
"His hands are so small"
"Why are his fingers so short?"
I'd like to extract the row that contains "is" and "small".
If I do
df.col_name.str.contains("is|small", case=False)
Then it catches "His" as well- which I don't want.
Is below query is the right way to catch the whole word in df.series?
df.col_name.str.contains("\bis\b|\bsmall\b", case=False)
Upvotes: 11
Views: 16609
Reputation: 474
In "\bis\b|\bsmall\b"
, the backslash \b
is parsed as ASCII Backspace before it is even passed to the regular expression method for matching/searching. For more information check this document about escape characters. It is mentioned in this document, that
When an ‘r’ or ‘R’ prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string.
Therefore, there are two options -
r
prefixdf.col_name.str.contains(r"\bis\b|\bsmall\b", case=False)
\
character -df.col_name.str.contains("\\bis\\b|\\bsmall\\b", case=False)
Upvotes: 2
Reputation: 6173
No, the regex /bis/b|/bsmall/b
will fail because you are using /b
, not \b
which means "word boundary".
Change that and you get a match. I would recommend using
\b(is|small)\b
This regex is a little faster and a little more legible, at least to me. Remember to put it in a raw string (r"\b(is|small)\b"
) so you don’t have to escape the backslashes.
Upvotes: 13
Reputation: 23
In extension to the discussion, I want to use a variable inside the regex as follows:
df = df_w[df_w['Country/Region'].str.match("\b(location.loc[i]['country'])\b",case=False)]
If I do not put \b\b, the code returns all the columns with both Sudan and South Sudan. While, when I use "\b(location.loc[i]['country'])\b", it returns empty dataframes. Kindly tell me the correct usage.
Upvotes: 0
Reputation: 109656
First, you may want to convert everything to lowercase, remove punctuation and whitespace and then convert the result into a set of words.
import string
df['words'] = [set(words) for words in
df['col_name']
.str.lower()
.str.replace('[{0}]*'.format(string.punctuation), '')
.str.strip()
.str.split()
]
>>> df
col_name words
0 This is Donald. {this, is, donald}
1 His hands are so small {small, his, so, are, hands}
2 Why are his fingers so short? {short, fingers, his, so, are, why}
You can now use boolean indexing to see if all of your target words are in these new word sets.
target_words = ['is', 'small']
# Convert target words to lower case just to be safe.
target_words = [word.lower() for word in target_words]
df['match'] = df.words.apply(lambda words: all(target_word in words
for target_word in target_words))
print(df)
# Output:
# col_name words match
# 0 This is Donald. {this, is, donald} False
# 1 His hands are so small {small, his, so, are, hands} False
# 2 Why are his fingers so short? {short, fingers, his, so, are, why} False
target_words = ['so', 'small']
target_words = [word.lower() for word in target_words]
df['match'] = df.words.apply(lambda words: all(target_word in words
for target_word in target_words))
print(df)
# Output:
# Output:
# col_name words match
# 0 This is Donald. {this, is, donald} False
# 1 His hands are so small {small, his, so, are, hands} True
# 2 Why are his fingers so short? {short, fingers, his, so, are, why} False
To extract the matching rows:
>>> df.loc[df.match, 'col_name']
# Output:
# 1 His hands are so small
# Name: col_name, dtype: object
To make this all into a single statement using boolean indexing:
df.loc[[all(target_word in word_set for target_word in target_words)
for word_set in (set(words) for words in
df['col_name']
.str.lower()
.str.replace('[{0}]*'.format(string.punctuation), '')
.str.strip()
.str.split())], :]
Upvotes: 4
Reputation: 3351
Your way (with /b) didn't work for me. I'm not sure why you can't use the logical operator and (&) since I think that's what you actually want.
This is a silly way to do it, but it works:
mask = lambda x: ("is" in x) & ("small" in x)
series_name.apply(mask)
Upvotes: 0