Ivo
Ivo

Reputation: 4200

Transform pd.DataFrame() to narrower, longer dataframe

I have a pandas data frame in which each case contains multiple sets of interesting information. In short, I want the columns to decrease and the data frame to become longer according to pre-specified relationships.

My old data frame looks like this:

old = pd.DataFrame(columns=['index', 'residency', 'rating_NYC', 'dist_NYC', 'rating_PAR', 'dist_PAR', 
                            'rating_LON', 'dist_LON', 'rating_MUM', 'dist_MUM', 'gen_rating'],
             data = [[0, 'New York', 9, 2,     5, 8,   4, 9,  3, 8,  6],
                     [1, 'Paris',    5, 9,     7, 1,   6, 2,  4, 6,  7]])

Each line is one individual stating her residency, rating a city (rating_xxx), stating her geographical distance to that city's centre dist_xxx, and giving a general rating of living in a city (each range 0-10).

I now want to create a new df with fewer columns and more rows. Each row in the old df yields information for multiple rows in the new one: I want one line per rating_xxx / dist_xxx combination in the old df (i.e. multiple lines per individual). The new df should contain: the old_index, the rating of and (iii) distance to that particular city, whether the individual is a resident of that city and the general rating (gen_rating).

For example, the first line in the new df would contain the first individual's ratings of/ distance to NYC and that she is NYC resident (and her general rating); the second line would contain the first individual's rating of/ distance to PAR etc.

Based on the above data frame, the desired output is:

pd.DataFrame(columns=['index', 'old_index', 'rating', 'dist', 'resident', 'gen_rating'],
             data = [      [0,           0,        9,      2,          1,            6], # NYC -> NYC
                           [1,           0,        5,      8,          0,            6], # NYC -> PAR
                           [2,           0,        4,      9,          0,            6], # NYC -> LON
                           [3,           0,        3,      8,          0,            6], # NYR -> MUM
                           [4,           1,        5,      9,          0,            7], # PAR -> NYC
                           [5,           1,        7,      1,          1,            7], # PAR -> PAR
                           [6,           1,        6,      2,          0,            7], # PAR -> LON
                           [7,           1,        4,      6,          0,            7]])# PAR -> MUM

Can someone point me to the correct function I need for this and the most efficient way of achieving this? (The actual data frame is a bit larger ;) ) Many thanks!

Upvotes: 1

Views: 40

Answers (1)

anky
anky

Reputation: 75100

You can first set the columns which remains single for each index as index , then split the column names to create a Multiindex and then use stack:

old_ = old.set_index(['index','residency','gen_rating'])
old_.columns = old_.columns.str.split('_',expand=True)

(old_.stack().reset_index(['index','gen_rating']).reset_index(drop=True)
                                              .rename_axis('New_Index'))

           index  gen_rating  dist  rating
New_Index                                 
0              0           6     9       4
1              0           6     8       3
2              0           6     2       9
3              0           6     8       5
4              1           7     2       6
5              1           7     6       4
6              1           7     9       5
7              1           7     1       7

Or if you want the reference later you can retain the stacked indexes :

old_.stack().reset_index(['index','gen_rating'])

               index  gen_rating  dist  rating
residency                                     
New York  LON      0           6     9       4
          MUM      0           6     8       3
          NYC      0           6     2       9
          PAR      0           6     8       5
Paris     LON      1           7     2       6
          MUM      1           7     6       4
          NYC      1           7     9       5
          PAR      1           7     1       7

Upvotes: 1

Related Questions