How to split a column data into other columns which is stored in a dataframe?

Question

The df is the dataframe which contain the following information.

 In [61]: df.head()
    Out[61]: 
       id  movie_id                  info
    0   1         1   Italy:1 January 1994
    1   2         2   USA:22 January 2006
    2   3         3   USA:12 February 2006
    3   4         4   USA:February 2006
    4   5         5   USA:2006

I want output like below:

In [61]: df.head()    
Out[61]: 
   id  movie_id    country Date    Month   Year
0   1         1    Italy    1     January  1994
1   2         2    USA      22    January  2006
2   3         3    USA      12    February 2006
3   4         4    USA      None  February 2006
4   5         5    USA      None  None     2006

The data is stored in dataframe and it must be overwrite into the dataframe.

akuiper · Accepted Answer

You can use regex :|\s+ to split the column on either semicolon or white spaces and specify the expand parameter to be true so that the result will expand to columns:

df[["country","Date","Month","Year"]] = df['info'].str.split(':|\s+', expand = True)

Update:

To handle optional missing dates and months, you could try extract with regular expression:

(df[["country","Date","Month","Year"]] = 
     df['info'].str.extract('^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'))

^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$' contains four capture groups corresponding to country, Date, Month, Year respectively;
^ and $ denote the start and end of the string;
([A-Za-z]+) captures the country which is before : and consists of letters;
(\d{1,2}) captures Date which consists of one or two digits but optional(with ? after the group), i.e, could be missing;
([A-Za-z]+) captures Month which consists of letters and it's marked as optional with ?;
(\d{4}) captures the year which consists of four digits;

How to split a column data into other columns which is stored in a dataframe?

Answers (2)

Related Questions