LRD
LRD

Reputation: 361

regex expression extract string after last comma with no numbers

Given a dataframe A that looks like this:

id information
001 Yellow, in town, John
002 Green, home, Lia 33
003 Yellow, garden, Peter2543
004 Red, 23 garden, 004 John891
005 Red, home, 245Sarah
006 Red 2, park 28, 67 Luke
007 Purple 03, to the beach, Mary Rose 9855
... ...

I want to create a new column called name by extracting the name from information, without numbers. That is:

id information name
001 Yellow, in town, John John
002 Green, wardrobe, home, Lia 33 Lia
003 Yellow, garden, Peter2543 Peter
004 Red, 23 garden, 004 John891 John
005 Red, hat, home, 245Sarah Sarah
006 Red 2, park 28, 67 Luke Luke
007 Purple 03, to the beach, Mary Rose 9855 Mary Rose
... ... ...

Notice that:

If I do:

A['name'] = A['information'].apply(lambda x: x.rsplit(',', 1)[1] if ',' in x else x)

it returns everything after the last comma (i.e: John, Lia 33, Peter 2543,...). But I need to only get the name.

I guess I have to use re.split() instead but I cannot figure out which should be the regex expression...

Upvotes: 1

Views: 676

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626794

You can use

import pandas as pd
df = pd.DataFrame({"information":["Yellow, in town, John","Green, home, Lia 33","Yellow, garden, Peter2543","Red, 23 garden, 004 John891","Red, home, 245Sarah","Red 2, park 28, 67 Luke","Purple 03, to the beach, Mary Rose 9855"]})
df['name'] = df['information'].str.extract(r'.*,\s*(?:\d+\s*)?([^\d,]+?)(?:\s*\d+)?$', expand=False)

Output:

>>> df['information'].str.extract(r'.*,\s*(?:\d+\s*)?([^\d,]+?)(?:\s*\d+)?$', expand=False)
0         John
1          Lia
2        Peter
3         John
4        Sarah
5         Luke
6    Mary Rose
Name: information, dtype: object

Details:

  • .*, - any zero or more chars other than line break chars as many as possible, and then a , char
  • \s* - zero or more whitespaces
  • (?:\d+\s*)? - an optional sequence of one or more digits and then zero or more whitespaces
  • ([^\d,]+?) - Group 1: one or more chars other than digits and comma, as few as possible
  • (?:\s*\d+)? - an optional sequence of zero or more whitespaces and then one or more digits
  • $ - end of string.

Upvotes: 3

Related Questions