Reputation: 361
Given a dataframe A
that looks like this:
id | information |
---|---|
001 | Yellow, in town, John |
002 | Green, home, Lia 33 |
003 | Yellow, garden, Peter2543 |
004 | Red, 23 garden, 004 John891 |
005 | Red, home, 245Sarah |
006 | Red 2, park 28, 67 Luke |
007 | Purple 03, to the beach, Mary Rose 9855 |
... | ... |
I want to create a new column called name
by extracting the name from information
, without numbers. That is:
id | information | name |
---|---|---|
001 | Yellow, in town, John | John |
002 | Green, wardrobe, home, Lia 33 | Lia |
003 | Yellow, garden, Peter2543 | Peter |
004 | Red, 23 garden, 004 John891 | John |
005 | Red, hat, home, 245Sarah | Sarah |
006 | Red 2, park 28, 67 Luke | Luke |
007 | Purple 03, to the beach, Mary Rose 9855 | Mary Rose |
... | ... | ... |
Notice that:
information
is always after the last commaIf I do:
A['name'] = A['information'].apply(lambda x: x.rsplit(',', 1)[1] if ',' in x else x)
it returns everything after the last comma (i.e: John, Lia 33, Peter 2543,...). But I need to only get the name.
I guess I have to use re.split()
instead but I cannot figure out which should be the regex expression...
Upvotes: 1
Views: 676
Reputation: 626794
You can use
import pandas as pd
df = pd.DataFrame({"information":["Yellow, in town, John","Green, home, Lia 33","Yellow, garden, Peter2543","Red, 23 garden, 004 John891","Red, home, 245Sarah","Red 2, park 28, 67 Luke","Purple 03, to the beach, Mary Rose 9855"]})
df['name'] = df['information'].str.extract(r'.*,\s*(?:\d+\s*)?([^\d,]+?)(?:\s*\d+)?$', expand=False)
Output:
>>> df['information'].str.extract(r'.*,\s*(?:\d+\s*)?([^\d,]+?)(?:\s*\d+)?$', expand=False)
0 John
1 Lia
2 Peter
3 John
4 Sarah
5 Luke
6 Mary Rose
Name: information, dtype: object
Details:
.*,
- any zero or more chars other than line break chars as many as possible, and then a ,
char\s*
- zero or more whitespaces(?:\d+\s*)?
- an optional sequence of one or more digits and then zero or more whitespaces([^\d,]+?)
- Group 1: one or more chars other than digits and comma, as few as possible(?:\s*\d+)?
- an optional sequence of zero or more whitespaces and then one or more digits$
- end of string.Upvotes: 3