Alhpa Delta
Alhpa Delta

Reputation: 3660

How to split a column based on a string (when it exists) into a separate column

I have a dataframe as follows:

df = pd.DataFrame({
'Event':['2018 Green Meeting','2018 Yellow Meeting','2018 Red Meeting',
'2017 Green Meeting','2017 Yellow Meeting','2017 Red Meeting',
'2016 Green Meeting','2016 Yellow Meeting','2016 Red Meeting',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Count':[1,2,3,4,5,6,7,8,9,10,11,12]
})

I want to split the Event Column into two columns 'Year' and 'Event_Name' to get the following output:

df2 = pd.DataFrame({
'Year':['2018','2018','2018',
'2017','2017','2017',
'2016','2016','2016',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Event_Name':['Green Meeting','Yellow Meeting','Red Meeting',
'Green Meeting','Yellow Meeting','Red Meeting',
'Green Meeting','Yellow Meeting','Red Meeting',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Count':[1,2,3,4,5,6,7,8,9,10,11,12]
})

When I try to use regex to do this. It dones not seem to be working. I am getting two columns 'Year' and 'Event_Name'. But the year is empty.

This is what I am using:

df[['Year','Event_Name']] = df['Event'].str.split(r'\d{4}',expand=True)

How do I get this to work correctly?

Upvotes: 1

Views: 77

Answers (4)

piRSquared
piRSquared

Reputation: 294258

pandas.Series.str.findall

s = df.Event.str.findall('(\d+|\D+)')

pd.DataFrame(dict(
    Count=df.Count,
    Event_Name=s.str[-1],
    Year=s.str[0]
))

    Count       Event_Name            Year
0       1    Green Meeting            2018
1       2   Yellow Meeting            2018
2       3      Red Meeting            2018
3       4    Green Meeting            2017
4       5   Yellow Meeting            2017
5       6      Red Meeting            2017
6       7    Green Meeting            2016
7       8   Yellow Meeting            2016
8       9      Red Meeting            2016
9      10     Blue Meeting    Blue Meeting
10     11   Purple Meeting  Purple Meeting
11     12    Green Meeting   Green Meeting

Non-regex

def f(x):
  a, b = x.split(None, 1)
  if a.isdecimal():
    return a, b
  else:
    return (x,)

s = df.Event.apply(f)

pd.DataFrame(dict(
    Count=df.Count,
    Event_Name=s.str[-1],
    Year=s.str[0]
))

    Count       Event_Name            Year
0       1    Green Meeting            2018
1       2   Yellow Meeting            2018
2       3      Red Meeting            2018
3       4    Green Meeting            2017
4       5   Yellow Meeting            2017
5       6      Red Meeting            2017
6       7    Green Meeting            2016
7       8   Yellow Meeting            2016
8       9      Red Meeting            2016
9      10     Blue Meeting    Blue Meeting
10     11   Purple Meeting  Purple Meeting
11     12    Green Meeting   Green Meeting

Upvotes: 4

BENY
BENY

Reputation: 323226

Using str.extract with fillna

df['Year']=df.Event.str.extract('(\d+)').fillna(df.Event)

then We do replace

df['even_name']=df.Event.str.replace('\d+', '')

Upvotes: 7

Atul Shanbhag
Atul Shanbhag

Reputation: 636

This should do the work

def get_year(x):
    try:
        return int(x.split()[0])
    except:
        return None

def get_event_name(x):
    try:
        year = int(x.split()[0])
        return ' '.join(x.split()[1: ])
    except:
        return x

df['Year'] = df['Event'].apply(lambda x: get_year(x))
df['Event_Name'] = df['Event'].apply(lambda x: get_event_name(x))
df = df.drop(['Event', ], axis=1)

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153460

Use extractall:

df[['Year','Event']] = df.Event.str.extractall('(\d{4})? ?(.+$)').reset_index('match', drop=True)

Output:

             Event  Count  Year
0    Green Meeting      1  2018
1   Yellow Meeting      2  2018
2      Red Meeting      3  2018
3    Green Meeting      4  2017
4   Yellow Meeting      5  2017
5      Red Meeting      6  2017
6    Green Meeting      7  2016
7   Yellow Meeting      8  2016
8      Red Meeting      9  2016
9     Blue Meeting     10   NaN
10  Purple Meeting     11   NaN
11   Green Meeting     12   NaN

Upvotes: 3

Related Questions