Reputation: 33
I am new to regex and have a regex replacement in a re.sub
that I can't figure out.
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'[/|#][A-Z|a-z|0-9|-]*','', test)
print(test)
The code should print:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some #1 String
But, instead I am currently getting this (with 4,5,8 not fully converted):
1-Some String
2-Some String
3-Some String
4-Some String (Fubar )
5-Some String (Fubar - .67 A)
6-Some String
7-Some String
8-Some String
Upvotes: 2
Views: 440
Reputation: 163632
Another option is to match only the last occurrence of # using a negative lookahead (?![^#\n\r]*#)
. For clarity I have put matching a space [ ]
between square brackets.
[ ]*(?:[/-][ ]*)?#(?![^#\n\r]*#)[\da-zA-Z. -]+
Explanation
[ ]*
Match 0+ times a space(?:[/-][ ]*)?
Optionally match /
or -
and 0+ spaces#
Match literally(?![^#\n\r]*#)
Negative lookahead, assert when is om the right does not contain #
[\da-zA-Z. -]+
Match 1+ times what is listed in the character classIn the replacement use an empty string.
Upvotes: 2
Reputation: 4418
It is probably easier to do it in two steps:
First: Clean up the part in parenthesis. After the '(' and some letters remove everything up to the closing ')'.
Second: Remove the unwanted stuff at the end of a line. A line ends either at '#' followed by 2 or more digits or a '/'. There may be a space before the '#' or '/'.
import re
paren_re = re.compile(r"([(][a-zA-Z]+)([^)]*)")
eol_re = re.compile(r"(.*?)\s*(?:#\d\d|/).*")
for line in test_cases:
result = paren_re.sub(r"\1", line)
result = eol_re.sub(r"\1", result)
print(result)
Upvotes: 1
Reputation: 22087
Please try the following:
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'\s*([/#]|- )[\sA-Za-z0-9-#\.]*(?=(\)|$))','', test)
print(test)
Result:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some #1 String
The regex (substring to delete) can be defined as:
Then the regex will look like:
\s*([/#]|- )[\sA-Za-z0-9-#\.]*(?=(\)|$))
positive lookahead
may require some explanation. The pattern (?=regex)
is a zero-width assertion meaning followed by regex
.
The benefit is the matched substring does not include the regex
and
you can use it as an anchor
.
Upvotes: 3
Reputation: 2882
I couldn't fit them into one regex, maybe someone can. Here's a 2-line solution:
import re
test_cases = [
"1-Some String #0123",
"2-Some String #1234-56-a",
"3-Some String #1234-56A ",
"4-Some String (Fubar/ #12-345-67A)",
"5-Some String (Fubar - #12-345.67 A)",
"6-Some String / #123",
"7-Some String/#0233",
"8-Some #1 String/#0233"
]
for test in test_cases:
test = re.sub(r'[\/#][\w\s\d\-]*', '', test)
test = re.sub(r'[\s\.\-\d]+\w+\)', ')', test)
print(test)
Output:
1-Some String
2-Some String
3-Some String
4-Some String (Fubar)
5-Some String (Fubar)
6-Some String
7-Some String
8-Some
Explain:
\w
for a-zA-Z
\d
for 0-9
\s
for spaces\.
for dot \-
for minusBut I'm confused with your last line of output, why it outputs #1 String
, based on what? If you confirm that you can write a specific regex for that pattern.
Upvotes: 0