Reputation: 511
The df is the dataframe which contain the following information.
In [61]: df.head()
Out[61]:
id movie_id info
0 1 1 Italy:1 January 1994
1 2 2 USA:22 January 2006
2 3 3 USA:12 February 2006
3 4 4 USA:February 2006
4 5 5 USA:2006
I want output like below:
In [61]: df.head()
Out[61]:
id movie_id country Date Month Year
0 1 1 Italy 1 January 1994
1 2 2 USA 22 January 2006
2 3 3 USA 12 February 2006
3 4 4 USA None February 2006
4 5 5 USA None None 2006
The data is stored in dataframe and it must be overwrite into the dataframe.
Upvotes: 1
Views: 96
Reputation: 214967
You can use regex :|\s+
to split the column on either semicolon or white spaces and specify the expand
parameter to be true so that the result will expand to columns:
df[["country","Date","Month","Year"]] = df['info'].str.split(':|\s+', expand = True)
Update:
To handle optional missing dates and months, you could try extract
with regular expression:
(df[["country","Date","Month","Year"]] =
df['info'].str.extract('^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'))
^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'
contains four capture groups corresponding to country, Date, Month, Year
respectively;^
and $
denote the start and end of the string;([A-Za-z]+)
captures the country which is before :
and consists of letters; (\d{1,2})
captures Date which consists of one or two digits but optional(with ?
after the group), i.e, could be missing;([A-Za-z]+)
captures Month which consists of letters and it's marked as optional with ?
;(\d{4})
captures the year which consists of four digits;Upvotes: 2
Reputation: 76927
Using split
string method.
In [163]: df[['country', 'date', 'month', 'year']] = df['info'].str.split('\W+', expand=True)
In [164]: df
Out[164]:
id movie_id info country date month year
0 1 1 Italy:1 January 1994 Italy 1 January 1994
1 2 2 USA:22 January 2006 USA 22 January 2006
2 3 3 USA:12 February 2006 USA 12 February 2006
3 4 4 USA:19 February 2006 USA 19 February 2006
4 5 5 USA:22 January 2006 USA 22 January 2006
Upvotes: 1