Reputation: 953
I have a problem to replacings in a string: I want to change all the appearances from 2h / 2h / 2 heure / 2heure / 2 heures / 2heures to #hour. I tried:
text = "I should leave the house at 16h45 but I am late and I should not be arriving between 2 h or 3h or maybe 4heures"
hour = re.compile(r'[0-9]+\s?(h|heures?)([0-9]+)?')
replaces = hour.sub('#hour', text)
print(replaces)
Output:
I should leave the house at #hour but I am late and I should not be arriving between #hour or #hour or maybe #houreures
Good output:
I should leave the house at #hour but I am late and I should not be arriving between #hour or #hour or maybe #hour
How could I solve this problem #houreures?
Upvotes: 2
Views: 57
Reputation: 163277
You need to switch the alternation because the h in the first part gets matched first.
In for example 4heures
, your regex matches one or more times a digit \d+
. Then in the alternation (h|heures?)
it can match the h
from heures
. In the replacement the matched 4h
will be replaced with #hour
resulting in #houreures
import re
text = "I should leave the house at 16h45 but I am late and I should not be arriving between 2 h or 3h or maybe 4heures"
hour = re.compile(r'[0-9]+\s?(heures?|h)([0-9]+)?')
replaces = hour.sub('#hour', text)
print(replaces)
Upvotes: 1
Reputation: 195418
Online demo here.
import re
text = "I should leave the house at 16h45 but I am late and I should not be arriving between 2 h or 3h or maybe 4heures"
s = re.sub(r'\d+\s*[h]?(eure)*[s]?\d*', '#hour', text)
print(s)
Output:
I should leave the house at #hour but I am late and I should not be arriving between #hour or #hour or maybe #hour
Upvotes: 1
Reputation: 626748
The h
alternative matched the h
in heures
and heures?
alternative was not even tested. Swapping the alternatives can solve the problem, but it is a better idea to use an optional non-capturing group instead (see solution below).
There is no need in the capturing parentheses in the pattern, I suggest removing them (or, if you want to use alternation, convert to a non-capturing group).
Besides, the ([0-9]+)?
pattern can be simplified to [0-9]*
.
You may use
[0-9]+\s?h(?:eures?)?[0-9]*
See the regex demo
Details
[0-9]+
- one or more digits\s?
- 1 or 0 whitespacesh
- a h
letter(?:eures?)?
- an optional non-capturing group that matches 1 or 0 occurrences of eure
or eures
[0-9]*
- 0 or more digits.See the Python demo:
import re
text = "I should leave the house at 16h45 but I am late and I should not be arriving between 2 h or 3h or maybe 4heures"
hour = re.compile(r'[0-9]+\s?h(?:eures?)?[0-9]*')
replaces = hour.sub('#hour', text)
print(replaces)
# => I should leave the house at #hour but I am late and I should not be arriving between #hour or #hour or maybe #hour
Upvotes: 2
Reputation: 861
Change the ordering of heures
and h
inside the parenthesis, like this:
[0-9]+\s?(heures?|h)([0-9]+)?
should work.
In case of (h|heures?)
, you are saying that if h
is not found then see if heures
is present. The thing is, whenever heures
is present, h
will always be present (its the first character of heures
). So, you need to change the ordering. You should first search for heures
, and if that is not present, then search for h
. So, replacing (h|heures?)
with (heures?|h)
solves the problem.
Upvotes: 2