Reputation: 747
There is a very complex regular expression.
But I have a problem with it. The #
and ++
characters are removed if there are letters after them.
Question: How to add an exception to current regex for (C++ and C# tokens)?
I've used the next regex:
import re
text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
text = re.sub(r'[!,.:;—](?= |$)', ' ', text)
print(re.sub(r'(?i)(?:(?!\.net\b|\b-\b)[^\w\s])+(?=[^\w\s]*\b)', ' ', text))
And I've had the next result:
'Must-have skills .Net programming experience 2 years experience in C++ C .Net C .Net C .Net '
Desired result:
'Must-have skills .Net programming experience 2 years experience in C++ C# .Net C++ .Net C .Net '
Current regex details
(?i)
- case insensitive mode on(?:(?!\.net\b|\b-\b)[^\w\s])+
- any punctuation char ([^\w\s]
), 1 or more occurrences, as many as possible, that does not start any of the sequences:
\.net\b
- .net
as whole word|
- or \b-\b
- a hyphen enclosed with word chars(?=[^\w\s]*\b)
- a positive lookahead that requires 0+ punctuation chars followed with a word boundary position immediately to the right of the current location.Upvotes: 4
Views: 322
Reputation: 482
Edit
#1
Same as below but much shorter, I'm defining the characters that must precede the captured ones all in one set
>>> import re
>>> text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
>>> re.sub('(?:(?<!\S)|(?<=[\s\+\.C#]))[\-!,.:;—/]|[\-!,.:;—/](?=\s|$)', ' ', text)
#Output
'Must-have skills .Net programming experience 2 years experience in C++ C# .Net C++ .Net C .Net '
.
Explanation
.
#2
Kind of a really dirty solution but
Will post an explanation later; might even refine it for better readability
>>> import re
>>> text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
>>> re.sub('(?:(?<!\S)|(?<=\s)|(?<=\+)|(?<=\.)|(?<=C)|(?<=#))[\-!,.:;—/]|[\-!,.:;—/](?=\s|$)', ' ', text)
#Output
'Must-have skills .Net programming experience 2 years experience in C++ C# .Net C++ .Net C .Net '
.
Edit: Explanation
(?:
I am opening by saying the query that I want to capture should in this case be preceded by the capture set which contains whatever is defined immediately behind (?:
.(?<!
and (?<=
cannot be set to ignore a range of values so I have to first start with (?:
and then give multiple (?<!
's and (?<=
's to say what's captured should or should NOT be preceded by this character, and NOT be preceded by this other character, and so on and so forth(?:
I am now able to set the values that what is captured should be or should not be preceded by (?<!\S)
it really is unneeded but I included it because it casts a safety net. It basically says the range [\-!,.:;—/]
should NOT be captured/acted on if it is preceded by any random non-whitespace character|(?<=\s)
I am saying *OR [\-!,.:;—/]
should be captured/acted on if it is preceded by any single whitespace character|(?<=\+)|(?<=\.)|(?<=C)
I'm saying OR [\-!,.:;—/]
should be captured/acted on if it is preceded by +, ., or C, so the \. OR just . [a period]
in [\-!,.:;—/]
will be capture/acted upon if it is preceded by C
like in your string (remember (?<=C)
); and ;
in [\-!,.:;—/]
will be captured/acted upon if it is preceded by +
(remember (?<=\+)
). )
before the |
closes the (?:
.|
as you know is OR, and because I can't make the statement an all in one, I have to redefine [\-!,.:;—/]
then make a lookahead to say, Capture/act on [\-!,.:;—/]
if it is followed by whitespace or the end of the string. With lookaheads, you're able to define regular single string type 'ranges' so you can actually use OR statements
within them but you cannot when you use lookaheadsUpvotes: 3
Reputation: 163362
You could use a single replacement by capturing in a group what you want to keep and remove what you don't want using an alternation.
That way you can extend the pattern with cases that you want to keep or want to remove. In the replacement you use the capturing group. Instead of using an inline modifier (?i)
you could also use re.IGNORECASE
in the code.
(c(?:\+{2}|#)|\.net\b)|[!,.:;/—]|-(?=[\d.])
That will match:
(
Capture group
c(?:\+{2}|#)|\.net\b
Match c++ or c# or .net)
Close capture group|
Or[!,.:;/—]
Match any listed in the character class|
Or-(?=[\d.])
Match a hyphen asserting what is directly on the right is a digit or a dotFor example
import re
regex = r"(c(?:\+{2}|#)|\.net\b)|[!,.:;/—]|-(?=[\d.])"
text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
text = re.sub(regex, r"\1 ", text, 0, re.IGNORECASE)
if text:
print (text)
# Must-have skills .Net programming experience 2 years experience in C++ C# .Net C++ .Net C .Net
Upvotes: 2
Reputation: 707
It's not quite the same as your output but I was able to do this with only a difference of white space by reversing the order of the two re.sub
s and adding a negative lookbehind.
text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
text = re.sub(r'(?i)(?:(?!\.net\b|\b-\b)(?<!C)(?<!C\+)[^\w\s])+(?=[^\w\s]*\b)', ' ', text)
text = re.sub('[!,.:;—](?= |$)', ' ', text)
Output:
print(text)
Must-have skills .Net programming experience 2 years experience in C++ C# .Net C++ .Net C .Net
Upvotes: 3