Reputation: 23
I want to extract a data column where each cell is a string type consisting a hotel's room number and occupied packages on a given time. Each cell looks like the following
624: COUPLE , 507: DELUXE+ ,301: HONEYMOON
Here's the code snippet I have written to collect all the room numbers occupied and the packages purchased.
import numpy as np
import pandas as pd
d = np.array(['624: COUPLE , 507: DELUXE+ ,301: HONEYMOON','614:FAMILY , 507: FAMILY+'])
df = pd.Series(d)
df= df.str.extractall(r'(?P<room>[0-9]+)(?P<package>[\S][^,]+)')
df
However the output keeps the colon in front of package name. Output of given python code
How do I remove the colon in front of package name in the output ????
Upvotes: 2
Views: 26
Reputation: 627101
You can put :
and an optional whitespace patterns between the two named capturing groups and use
>>> df.str.extractall(r'(?P<room>[0-9]+):\s*(?P<package>[^\s,]+)')
room package
match
0 0 624 COUPLE
1 507 DELUXE+
2 301 HONEYMOON
1 0 614 FAMILY
1 507 FAMILY+
See the regex demo. Details:
(?P<room>[0-9]+)
- Group "room": one or more digits:\s*
- a colon and then zero or more whitespaces(?P<package>[^\s,]+)
- Group "package": one or more chars other than whitespace and a comma.Upvotes: 1