Nishant Singh
Nishant Singh

Reputation: 511

How to split a column data into other columns which is stored in a dataframe?

The df is the dataframe which contain the following information.

 In [61]: df.head()
    Out[61]: 
       id  movie_id                  info
    0   1         1   Italy:1 January 1994
    1   2         2   USA:22 January 2006
    2   3         3   USA:12 February 2006
    3   4         4   USA:February 2006
    4   5         5   USA:2006

I want output like below:

In [61]: df.head()    
Out[61]: 
   id  movie_id    country Date    Month   Year
0   1         1    Italy    1     January  1994
1   2         2    USA      22    January  2006
2   3         3    USA      12    February 2006
3   4         4    USA      None  February 2006
4   5         5    USA      None  None     2006

The data is stored in dataframe and it must be overwrite into the dataframe.

Upvotes: 1

Views: 96

Answers (2)

akuiper
akuiper

Reputation: 214967

You can use regex :|\s+ to split the column on either semicolon or white spaces and specify the expand parameter to be true so that the result will expand to columns:

df[["country","Date","Month","Year"]] = df['info'].str.split(':|\s+', expand = True)

enter image description here

Update:

To handle optional missing dates and months, you could try extract with regular expression:

(df[["country","Date","Month","Year"]] = 
     df['info'].str.extract('^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$'))
  • ^([A-Za-z]+):(\d{1,2})? ?([A-Za-z]+)? ?(\d{4})$' contains four capture groups corresponding to country, Date, Month, Year respectively;
  • ^ and $ denote the start and end of the string;
  • ([A-Za-z]+) captures the country which is before : and consists of letters;
  • (\d{1,2}) captures Date which consists of one or two digits but optional(with ? after the group), i.e, could be missing;
  • ([A-Za-z]+) captures Month which consists of letters and it's marked as optional with ?;
  • (\d{4}) captures the year which consists of four digits;

enter image description here

Upvotes: 2

Zero
Zero

Reputation: 76927

Using split string method.

In [163]: df[['country', 'date', 'month', 'year']] = df['info'].str.split('\W+', expand=True)

In [164]: df
Out[164]:
   id  movie_id                  info country date     month  year
0   1         1  Italy:1 January 1994   Italy    1   January  1994
1   2         2   USA:22 January 2006     USA   22   January  2006
2   3         3  USA:12 February 2006     USA   12  February  2006
3   4         4  USA:19 February 2006     USA   19  February  2006
4   5         5   USA:22 January 2006     USA   22   January  2006

Upvotes: 1

Related Questions