Reputation: 747

Add exception to complicated regex

There is a very complex regular expression.

But I have a problem with it. The # and ++ characters are removed if there are letters after them.

Question: How to add an exception to current regex for (C++ and C# tokens)?

I've used the next regex:

import re

text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
text = re.sub(r'[!,.:;—](?= |$)', ' ', text)
print(re.sub(r'(?i)(?:(?!\.net\b|\b-\b)[^\w\s])+(?=[^\w\s]*\b)', ' ', text))

And I've had the next result:

'Must-have skills   .Net programming experience   2 years experience in C++  C .Net  C .Net  C .Net '

Desired result:

'Must-have skills   .Net programming experience   2 years experience in C++  C# .Net  C++ .Net  C .Net '

Current regex details

(?i) - case insensitive mode on
(?:(?!\.net\b|\b-\b)[^\w\s])+ - any punctuation char ([^\w\s]), 1 or more occurrences, as many as possible, that does not start any of the sequences:
- \.net\b - .net as whole word
- | - or
- \b-\b - a hyphen enclosed with word chars
(?=[^\w\s]*\b) - a positive lookahead that requires 0+ punctuation chars followed with a word boundary position immediately to the right of the current location.

Upvotes: 4

Answers (3)

FailSafe

Reputation: 482

Edit

Same as below but much shorter, I'm defining the characters that must precede the captured ones all in one set

>>> import re

>>> text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'

>>> re.sub('(?:(?<!\S)|(?<=[\s\+\.C#]))[\-!,.:;—/]|[\-!,.:;—/](?=\s|$)', ' ', text)


#Output
'Must-have skills   .Net programming experience   2 years experience in C++  C# .Net  C++ .Net  C  .Net '

Explanation

The answer here is effectively the same as the one that follows below but instead of declaring the characters that must precede the captured set that will be acted upon one by one, this I defines them all in one set.

Kind of a really dirty solution but

Will post an explanation later; might even refine it for better readability

>>> import re

>>> text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'

>>> re.sub('(?:(?<!\S)|(?<=\s)|(?<=\+)|(?<=\.)|(?<=C)|(?<=#))[\-!,.:;—/]|[\-!,.:;—/](?=\s|$)', ' ', text)


#Output
'Must-have skills   .Net programming experience   2 years experience in C++  C# .Net  C++ .Net  C  .Net '

Edit: Explanation

So by opening with (?: I am opening by saying the query that I want to capture should in this case be preceded by the capture set which contains whatever is defined immediately behind (?:.
The key here is that the lookaheads which start with (?<! and (?<= cannot be set to ignore a range of values so I have to first start with (?: and then give multiple (?<!'s and (?<='s to say what's captured should or should NOT be preceded by this character, and NOT be preceded by this other character, and so on and so forth
So having opened with (?: I am now able to set the values that what is captured should be or should not be preceded by
Starting with (?<!\S) it really is unneeded but I included it because it casts a safety net. It basically says the range [\-!,.:;—/] should NOT be captured/acted on if it is preceded by any random non-whitespace character
With |(?<=\s) I am saying *OR [\-!,.:;—/] should be captured/acted on if it is preceded by any single whitespace character
With |(?<=\+)|(?<=\.)|(?<=C) I'm saying OR [\-!,.:;—/] should be captured/acted on if it is preceded by +, ., or C, so the \. OR just . [a period] in [\-!,.:;—/] will be capture/acted upon if it is preceded by C like in your string (remember (?<=C)); and ; in [\-!,.:;—/] will be captured/acted upon if it is preceded by + (remember (?<=\+)).
The final ) before the | closes the (?:.
| as you know is OR, and because I can't make the statement an all in one, I have to redefine [\-!,.:;—/] then make a lookahead to say, Capture/act on [\-!,.:;—/] if it is followed by whitespace or the end of the string. With lookaheads, you're able to define regular single string type 'ranges' so you can actually use OR statements within them but you cannot when you use lookaheads

Upvotes: 3

The fourth bird

Reputation: 163362

You could use a single replacement by capturing in a group what you want to keep and remove what you don't want using an alternation.

That way you can extend the pattern with cases that you want to keep or want to remove. In the replacement you use the capturing group. Instead of using an inline modifier (?i) you could also use re.IGNORECASE in the code.

(c(?:\+{2}|#)|\.net\b)|[!,.:;/—]|-(?=[\d.])

That will match:

( Capture group
- c(?:\+{2}|#)|\.net\b Match c++ or c# or .net
) Close capture group
| Or
[!,.:;/—] Match any listed in the character class
| Or
-(?=[\d.]) Match a hyphen asserting what is directly on the right is a digit or a dot

Regex demo | Python demo

For example

import re
regex = r"(c(?:\+{2}|#)|\.net\b)|[!,.:;/—]|-(?=[\d.])"
text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
text = re.sub(regex, r"\1 ", text, 0, re.IGNORECASE)

if text:
    print (text)

# Must-have skills   .Net  programming experience   2 years experience in C++   C#  .Net   C++  .Net   C  .Net

Upvotes: 2

Chthonyx

Reputation: 707

It's not quite the same as your output but I was able to do this with only a difference of white space by reversing the order of the two re.subs and adding a negative lookbehind.

text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
text = re.sub(r'(?i)(?:(?!\.net\b|\b-\b)(?<!C)(?<!C\+)[^\w\s])+(?=[^\w\s]*\b)', ' ', text)
text = re.sub('[!,.:;—](?= |$)', ' ', text)

Output:

print(text)
Must-have skills   .Net programming experience   2 years experience in C++  C# .Net  C++ .Net  C  .Net

Upvotes: 3

Add exception to complicated regex

Answers (3)

Related Questions