Ricky Rick
Ricky Rick

Reputation: 61

Extract exact words or set of characters using Regex in Python

Suppose I have a list like this.

List = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']. 

I want to search and return a match where 'PO' is there. Technically I should have RUC_PO-345 as my output, but even RUC_POLO-209 is getting returned as an output along with RUC_PO-345.

Upvotes: 3

Views: 4300

Answers (6)

Omnifarious
Omnifarious

Reputation: 56048

You should be using a regular expression (import re), and this is the regular expression you should be using: r'(?<![A-Za-z0-9])PO(?![A-Za-z0-9])'.

I previously recommended the \b special sequence, but it turns out the '_' is considered part of a word, and that isn't the case for you, so it wouldn't work.

This leaves you with the somewhat more complex negative look behind and negative lookahead assertions, which is what (?<!... and (?!... are, respectively. To understand how those work, read the documentation for Python regular expressions.

Upvotes: 0

jameshollisandrew
jameshollisandrew

Reputation: 1331

The pattern:

‘_PO[^\w]’

should work with a re.search() or re.findall() call; it will not work with a re.match as it doesn’t consider the characters at the beginning of the string.

The pattern reads: match 1 underscore (‘_’) followed by 1 capital P (‘P’) followed by 1 capital O (‘O’) followed by one character that is not a word character. The special character ‘\w’ matches [a-zA-Z0-9_].

‘_PO\W’

^ This might also be used as a shorter version to the first pattern suggested (credit @JvdV in comments)

‘_PO[^A-Za-z]’

This pattern uses the, ‘Set of characters not alpha characters.’ In the event the dash interferes with either of the first two patterns.

To use this to identify the pattern in a list, you can use a loop:

import re

For thing in my_list:
    if re.search(‘_PO[^\w]’, thing) is not None:
        # do something
        print(thing)

This will use the re.search call to match the pattern as the True condition in the if conditional. When re doesn’t match a string, it returns None; hence the syntax of...if re.search() is not None.

Hope it helps!

Upvotes: 1

JvdV
JvdV

Reputation: 75850

Before updated question:

As per my comment, I think you are using the wrong approach. To me it seems you can simply use in:

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
if 'cat' in words:
    print("yes")
else:
    print("no")

Returns: yes

words = ['cats', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
if 'cat' in words:
    print("yes")
else:
    print("no")

Returns: no


After updated question:

Now if your sample data does not actually reflect your needs but you are interested to find a substring within a list element, you could try:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = 'PO'
r = re.compile(fr'(?<=_){srch}(?=-)')
print(list(filter(r.findall, words)))

Or using match:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = 'PO'
r = re.compile(fr'^.*(?<=_){srch}(?=-).*$')
print(list(filter(r.match, words)))

This will return a list of items (in this case just ['RUC_PO-345']) that follow the pattern. I used the above regular pattern to make sure your searchvalue won't be at the start of the searchstrings, but would be after an underscore, and followed by a -.


Now if you have a list of products you want to find, consider the below:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = ['PO', 'QW']
r = re.compile(fr'(?<=_)({"|".join(srch)})(?=-)')
print(list(filter(r.findall, words)))

Or again using match:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = ['PO', 'QW']
r = re.compile(fr'^.*(?<=_)({"|".join(srch)})(?=-).*$')
print(list(filter(r.match, words)))

Both would return: ['MX_QW-765', 'RUC_PO-345']

Note that if you don't have f-strings supported you can also concat your variable into the pattern.

Upvotes: 5

Udaya Prakash
Udaya Prakash

Reputation: 541

We can try matching one of the three exact words 'cat','dog','monk' in our regex string.

Our regex string is going to be "\b(?:cat|dog|monk)\b"

\b is used to define word boundary. We use \b so that we could search for whole words (this is the exact problem you were facing). Adding this would not match tomcat or caterpillar and only cat

Next, (?:) is called Non capturing group (Explained here )

Now we need to match either one of cat or dog or monk. So this is expressed as cat|dog|monk. In python 3 this would be:

import re

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
regex = r"\b(?:cat|dog|monk)\b"

r=re.compile(regex)
matched = list(filter(r.match, words))

print(matched)

To implement matching regex through an iterable list, we use filter function as mentioned in a Stackoverflow answer here

You can find the runnable Python code here

NOTE: Finally, regex101 is a great online tool to try out different regex strings and get their explanation in real-time. The explanation for our regex string is here

Upvotes: 0

user12867493
user12867493

Reputation:

You need to add a $ sign which signifies the end of a string, you can also add a ^ which is the start of a string so only cat matches:

 ^cat$

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521239

Try building a regex alternation using the search terms in the list:

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
your_text = 'I like cat, dog, rabbit, antelope, and monkey, but not giraffes'
regex = r'\b(?:' + '|'.join(words) + r')\b'
print(regex)
matches = re.findall(regex, your_text)
print(matches)

This prints:

\b(?:cat|caterpillar|monkey|monk|doggy|doggo|dog)\b
['cat', 'dog', 'monkey']

You can clearly see the regex alternation which we built to find all matching keywords.

Upvotes: 1

Related Questions