Samira Kumar
Samira Kumar

Reputation: 521

Python regex to get numbers between optional strings

I've the following texts:

"Start 2 h 30 m End 3 h 20 m"
"Start 30 m End 10 m"

How can I extract the numbers alone so that output is

|Start_h|Start_m|End_h|End_m|
|-------|-------|-----|-----|
|      2|     30|    3|   20|
|    Nan|     30|  Nan|   10|

My attempt at this was using string extract in pandas.

df['time'].str.extract(r'Start (\w+) h (\w+) m ') but this doesn't give me "m" alone if "h" is not present

Upvotes: 0

Views: 76

Answers (2)

Radosław Cybulski
Radosław Cybulski

Reputation: 2992

Try this one:

import re
r = re.compile('[^0-9]*([0-9]*)')
t = 'Start 2 h 30 m End 3 h 20 m'
i = 0
while i < len(t):
    m = r.search(t, i)
    if not m:
        break
    print(m.group(1))
    i = m.end(0)

The code will try in loop to skip non-digit fragmets and then produce digit fragment one by one. You can't do that in "full" regex easily, as regex requires you to specify amount of numbers you expect upfront.

EDIT: Use @accdias version, as it's better.

Upvotes: 0

DYZ
DYZ

Reputation: 57085

Here's a more robust pattern:

TIME = r"(?:(\d+) h )?(\d+) m" # Optional hr, required min
PATTERN = "Start {} End {}".format(TIME, TIME)
df['time'].str.extract(PATTERN)
#     0   1    2   3
#0    2  30    3  20
#1  NaN  30  NaN  10

Note that you need a separate matching group for each column, the total of four groups.

Upvotes: 3

Related Questions