user2755526
user2755526

Reputation: 197

Splitting and unpivoting a column in pandas

I have the following dataframe in pandas read from a .csv file.

    Month  Date       Year   Obs_score   
0   Jan    24,25,26   2015   -1.75   
1   Mar    2          2015   -2.0  
2   Apr    3          2015   -0.5  

I need to unpivot the Date column so that I get the following.

    Month  Date       Year   Obs_score   
0   Jan    24         2015   -1.75
1   Jan    25         2015   -1.75
2   Jan    26         2015   -1.75
3   Mar    2          2015   -2.0  
4   Apr    3          2015   -0.5

Basically make the Date field uniform and have a single value per row and replicate the Obs_scores for the comma separated dates. Is there a simple way to do this? The order of the indices are irrelevant.

Upvotes: 2

Views: 1038

Answers (1)

Acorbe
Acorbe

Reputation: 8391

A possible solution is based on pandas merge method to create databases on the basis of relations.

Case test:

 # (just two columns for ease)
 df = pd.DataFrame({'Date' : ['24,25,26','2','3']
         , 'score' : [-1.75,-2.0,-0.5] })  

i.e.

       Date  score
0  24,25,26  -1.75
1         2  -2.00
2         3  -0.50

1: produce the "split" column (the assumption that the entries are comma separated values in strings)

  b = df['Date'].apply(lambda x : pd.Series(x.split(","))).stack()

 0  0    24
    1    25
    2    26
 1  0     2
 2  0     3

2: reset the indices and merge on the base of the proper labels

 b_reset = b.reset_index()

     level_0  level_1   0
  0        0        0  24
  1        0        1  25
  2        0        2  26
  3        1        0   2
  4        2        0   3

 df_reset = df.reset_index()

    index      Date  score
  0      0  24,25,26  -1.75
  1      1         2  -2.00
  2      2         3  -0.50

It's clear that one has to merge b_reset and df_reset in a relation one-to-many, linking level_0 of b_reset and index of df_reset:

df_temp = pd.merge(b_reset 
           , df_reset
           , left_on = 'level_0'
           , right_on = 'index' )

3: we finally keep just the useful columns

df_t[['score',0]]


   score   0
0  -1.75  24
1  -1.75  25
2  -1.75  26
3  -2.00   2
4  -0.50   3

Upvotes: 1

Related Questions