BN production
BN production

Reputation: 23

pandas series string extraction using regular expression : How to exclude certain symbols from the beginning?

I want to extract a data column where each cell is a string type consisting a hotel's room number and occupied packages on a given time. Each cell looks like the following

                          624: COUPLE , 507: DELUXE+ ,301: HONEYMOON 

Here's the code snippet I have written to collect all the room numbers occupied and the packages purchased.

import numpy as np
import pandas as pd
d = np.array(['624: COUPLE , 507: DELUXE+ ,301: HONEYMOON','614:FAMILY , 507: FAMILY+'])
df = pd.Series(d)
df= df.str.extractall(r'(?P<room>[0-9]+)(?P<package>[\S][^,]+)')
df
          

However the output keeps the colon in front of package name. Output of given python code

How do I remove the colon in front of package name in the output ????

Upvotes: 2

Views: 26

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627101

You can put : and an optional whitespace patterns between the two named capturing groups and use

>>> df.str.extractall(r'(?P<room>[0-9]+):\s*(?P<package>[^\s,]+)')
        room    package
  match                
0 0      624     COUPLE
  1      507    DELUXE+
  2      301  HONEYMOON
1 0      614     FAMILY
  1      507    FAMILY+

See the regex demo. Details:

  • (?P<room>[0-9]+) - Group "room": one or more digits
  • :\s* - a colon and then zero or more whitespaces
  • (?P<package>[^\s,]+) - Group "package": one or more chars other than whitespace and a comma.

Upvotes: 1

Related Questions