Reputation: 521
I've the following texts:
"Start 2 h 30 m End 3 h 20 m"
"Start 30 m End 10 m"
How can I extract the numbers alone so that output is
|Start_h|Start_m|End_h|End_m|
|-------|-------|-----|-----|
| 2| 30| 3| 20|
| Nan| 30| Nan| 10|
My attempt at this was using string extract in pandas.
df['time'].str.extract(r'Start (\w+) h (\w+) m ')
but this doesn't give me "m" alone if "h" is not present
Upvotes: 0
Views: 76
Reputation: 2992
Try this one:
import re
r = re.compile('[^0-9]*([0-9]*)')
t = 'Start 2 h 30 m End 3 h 20 m'
i = 0
while i < len(t):
m = r.search(t, i)
if not m:
break
print(m.group(1))
i = m.end(0)
The code will try in loop to skip non-digit fragmets and then produce digit fragment one by one. You can't do that in "full" regex easily, as regex requires you to specify amount of numbers you expect upfront.
EDIT: Use @accdias version, as it's better.
Upvotes: 0
Reputation: 57085
Here's a more robust pattern:
TIME = r"(?:(\d+) h )?(\d+) m" # Optional hr, required min
PATTERN = "Start {} End {}".format(TIME, TIME)
df['time'].str.extract(PATTERN)
# 0 1 2 3
#0 2 30 3 20
#1 NaN 30 NaN 10
Note that you need a separate matching group for each column, the total of four groups.
Upvotes: 3