Reputation: 345
Goal
Extract number before word hours
, hour
, day
, or days
|
to match the words?s = '2 Approximately 5.1 hours 100 ays 1 s'
re.findall(r"([\d.+-/]+)\s*[days|hours]", s) # note I do not know whether string s contains hours or days
return
['5.1', '100', '1']
Since 100 and 1 are not before the exact word hours, they should not show up. Expected
5.1
s1 = '2 Approximately 10.2 +/- 30hours'
re.findall(r"([\d. +-/]+)\s*hours|\s*hours", s)
return
['10.2 +/- 30']
Expect
10.2
Note that special characters +/-.
is optional. When .
appears such as 1.3
, 1.3 will need to show up with the .
. But when 1 +/- 0.5
happens, 1 will need to be extracted and none of the +/-
should be extracted.
I know I could probably do a split and then take the first number
str(re.findall(r"([\d. +-/]+)\s*hours", s1)[0]).split(" ")[1]
Gives
'10.2'
But some of the results only return one number so a split will cause an error. Should I do this with another step or could this be done in one step?
Please note that these strings s1
, s2
are the values in a dataframe. Therefore, iteration using function like apply
and lambda
will be needed.
Upvotes: 2
Views: 3073
Reputation: 17166
Code
import re
units = '|'.join(["hours", "hour", "hrs", "days", "day", "minutes", "minute", "min"]) # possible units
number = '\d+[.,]?\d*' # pattern for number
plus_minus = '\+\/\-' # plus minus
cases = fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'
pattern = re.compile(cases)
Tests
print(pattern.findall('2 Approximately 5.1 hours 100 ays 1 s'))
# Output: [5.1]
print(pattern.findall('2 Approximately 10.2 +/- 30hours'))
# Output: ['10.2']
print(pattern.findall('The mean half-life for Cetuximab is 114 hours (range 75-188 hours).'))
# Output: ['114', '75']
print(pattern.findall('102 +/- 30 hours in individuals with rheumatoid arthritis and 68 hours in healthy adults.'))
# Output: ['102', '68']
print(pattern.findall("102 +/- 30 hrs"))
# Output: ['102']
print(pattern.findall("102-130 hrs"))
# Output: ['102']
print(pattern.findall("102hrs"))
# Output: ['102']
print(pattern.findall("102 hours"))
# Output: ['102']
Explanation
Above uses the convenience that raw strings (r'...') and string interpolation f'...' can be combined to:
fr'...'
per PEP 498
The cases strings:
fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'
Parts are sequence:
Upvotes: 2
Reputation: 522094
In fact, I would use re.findall
here:
units = ["hours", "hour", "days", "day"] # the order matters here: put plurals first
regex = r'(?:' + '|'.join(units) + r')'
s = '2 Approximately 5.1 hours 100 ays 1 s'
values = re.findall(r'\b(\d+(?:\.\d+)?)\s+' + regex, s)
print(values) # prints [('5.1')]
If you want to also capture the units being used, then make the units alternation capturing, i.e. use:
regex = r'(' + '|'.join(units) + r')'
Then the output would be:
[('5.1', 'hours')]
Upvotes: 3