Reputation: 3660
I have a dataframe as follows:
df = pd.DataFrame({
'Event':['2018 Green Meeting','2018 Yellow Meeting','2018 Red Meeting',
'2017 Green Meeting','2017 Yellow Meeting','2017 Red Meeting',
'2016 Green Meeting','2016 Yellow Meeting','2016 Red Meeting',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Count':[1,2,3,4,5,6,7,8,9,10,11,12]
})
I want to split the Event Column into two columns 'Year' and 'Event_Name' to get the following output:
df2 = pd.DataFrame({
'Year':['2018','2018','2018',
'2017','2017','2017',
'2016','2016','2016',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Event_Name':['Green Meeting','Yellow Meeting','Red Meeting',
'Green Meeting','Yellow Meeting','Red Meeting',
'Green Meeting','Yellow Meeting','Red Meeting',
'Blue Meeting','Purple Meeting','Green Meeting'],
'Count':[1,2,3,4,5,6,7,8,9,10,11,12]
})
When I try to use regex to do this. It dones not seem to be working. I am getting two columns 'Year' and 'Event_Name'. But the year is empty.
This is what I am using:
df[['Year','Event_Name']] = df['Event'].str.split(r'\d{4}',expand=True)
How do I get this to work correctly?
Upvotes: 1
Views: 77
Reputation: 294258
pandas.Series.str.findall
s = df.Event.str.findall('(\d+|\D+)')
pd.DataFrame(dict(
Count=df.Count,
Event_Name=s.str[-1],
Year=s.str[0]
))
Count Event_Name Year
0 1 Green Meeting 2018
1 2 Yellow Meeting 2018
2 3 Red Meeting 2018
3 4 Green Meeting 2017
4 5 Yellow Meeting 2017
5 6 Red Meeting 2017
6 7 Green Meeting 2016
7 8 Yellow Meeting 2016
8 9 Red Meeting 2016
9 10 Blue Meeting Blue Meeting
10 11 Purple Meeting Purple Meeting
11 12 Green Meeting Green Meeting
def f(x):
a, b = x.split(None, 1)
if a.isdecimal():
return a, b
else:
return (x,)
s = df.Event.apply(f)
pd.DataFrame(dict(
Count=df.Count,
Event_Name=s.str[-1],
Year=s.str[0]
))
Count Event_Name Year
0 1 Green Meeting 2018
1 2 Yellow Meeting 2018
2 3 Red Meeting 2018
3 4 Green Meeting 2017
4 5 Yellow Meeting 2017
5 6 Red Meeting 2017
6 7 Green Meeting 2016
7 8 Yellow Meeting 2016
8 9 Red Meeting 2016
9 10 Blue Meeting Blue Meeting
10 11 Purple Meeting Purple Meeting
11 12 Green Meeting Green Meeting
Upvotes: 4
Reputation: 323226
Using str.extract
with fillna
df['Year']=df.Event.str.extract('(\d+)').fillna(df.Event)
then We do replace
df['even_name']=df.Event.str.replace('\d+', '')
Upvotes: 7
Reputation: 636
This should do the work
def get_year(x):
try:
return int(x.split()[0])
except:
return None
def get_event_name(x):
try:
year = int(x.split()[0])
return ' '.join(x.split()[1: ])
except:
return x
df['Year'] = df['Event'].apply(lambda x: get_year(x))
df['Event_Name'] = df['Event'].apply(lambda x: get_event_name(x))
df = df.drop(['Event', ], axis=1)
Upvotes: 0
Reputation: 153460
Use extractall
:
df[['Year','Event']] = df.Event.str.extractall('(\d{4})? ?(.+$)').reset_index('match', drop=True)
Output:
Event Count Year
0 Green Meeting 1 2018
1 Yellow Meeting 2 2018
2 Red Meeting 3 2018
3 Green Meeting 4 2017
4 Yellow Meeting 5 2017
5 Red Meeting 6 2017
6 Green Meeting 7 2016
7 Yellow Meeting 8 2016
8 Red Meeting 9 2016
9 Blue Meeting 10 NaN
10 Purple Meeting 11 NaN
11 Green Meeting 12 NaN
Upvotes: 3