Reputation: 809
I'm trying to find out how many times a variant of a German job name appears in a given string. Suppose the job name is Schneider
(Tailor). The variants (denoting male and female forms of the job name) including the job name itself are:
Schneider
Schneiderin
Schneider/in
Schneider/-in
Schneider (m/w)
So suppose I have the following string:
Schneider Schneiderin Schneider/in Schneider/-in Schneider (m/w)
Each variant should be counted individually, disregarding any overlapping between the variants. So if I go through each variant and count the number of occurrences in the above string, the result should always be 1.
I tried to solve this with a regex using word boundaries. I used the following pattern:
\b{}\b(?![\/]|(\s\(m\/w\)))
where {} will be replaced by the variant.
As you can see the regex uses word boundaries to make sure only full word matches are found. Additionally it uses forward lookahead to exclude forward slashes and (m/w)
from being treated as word boundaries.
The pattern works well except for the last pattern (Schneider (m/w)
) which is not found in the string. You can see this in action here: https://regex101.com/r/FTqvIO/4
For the sake of completeness here's my current implementation in Python:
import re
def count_variant(variant, string):
pattern = re.compile(r'\b%s\b(?![\/]|(\s\(m\/w\)))' % variant)
matches = re.findall(pattern, string)
return len(matches)
Any help on the regex (or an easier approach if available) is greatly appreciated!
Edit: Inserted the correct link to Regex101
Upvotes: 0
Views: 216
Reputation: 626851
You may use unambiguous word boundaries:
r'(?<!\w){}(?![\w/]|\s\(m/w\))'.format(re.escape(word))
See the regex demo
The (?<!\w)
will fail the match if there is a word char in front of the search term and (?!\w)
will fail the match if there is a word char after the search word.
Upvotes: 1