Reputation: 4200
I have a pandas data frame in which each case contains multiple sets of interesting information. In short, I want the columns to decrease and the data frame to become longer according to pre-specified relationships.
My old data frame looks like this:
old = pd.DataFrame(columns=['index', 'residency', 'rating_NYC', 'dist_NYC', 'rating_PAR', 'dist_PAR',
'rating_LON', 'dist_LON', 'rating_MUM', 'dist_MUM', 'gen_rating'],
data = [[0, 'New York', 9, 2, 5, 8, 4, 9, 3, 8, 6],
[1, 'Paris', 5, 9, 7, 1, 6, 2, 4, 6, 7]])
Each line is one individual stating her residency
, rating a city (rating_xxx
), stating her geographical distance to that city's centre dist_xxx
, and giving a general rating of living in a city (each range 0
-10
).
I now want to create a new df with fewer columns and more rows. Each row in the old df yields information for multiple rows in the new one: I want one line per rating_xxx
/ dist_xxx
combination in the old df (i.e. multiple lines per individual). The new df should contain: the old_index
, the rating
of and (iii) distance
to that particular city, whether the individual is a resident
of that city and the general rating (gen_rating
).
For example, the first line in the new df would contain the first individual's ratings of/ distance to NYC and that she is NYC resident (and her general rating); the second line would contain the first individual's rating of/ distance to PAR etc.
Based on the above data frame, the desired output is:
pd.DataFrame(columns=['index', 'old_index', 'rating', 'dist', 'resident', 'gen_rating'],
data = [ [0, 0, 9, 2, 1, 6], # NYC -> NYC
[1, 0, 5, 8, 0, 6], # NYC -> PAR
[2, 0, 4, 9, 0, 6], # NYC -> LON
[3, 0, 3, 8, 0, 6], # NYR -> MUM
[4, 1, 5, 9, 0, 7], # PAR -> NYC
[5, 1, 7, 1, 1, 7], # PAR -> PAR
[6, 1, 6, 2, 0, 7], # PAR -> LON
[7, 1, 4, 6, 0, 7]])# PAR -> MUM
Can someone point me to the correct function I need for this and the most efficient way of achieving this? (The actual data frame is a bit larger ;) ) Many thanks!
Upvotes: 1
Views: 40
Reputation: 75100
You can first set the columns which remains single for each index as index , then split the column names to create a Multiindex and then use stack
:
old_ = old.set_index(['index','residency','gen_rating'])
old_.columns = old_.columns.str.split('_',expand=True)
(old_.stack().reset_index(['index','gen_rating']).reset_index(drop=True)
.rename_axis('New_Index'))
index gen_rating dist rating
New_Index
0 0 6 9 4
1 0 6 8 3
2 0 6 2 9
3 0 6 8 5
4 1 7 2 6
5 1 7 6 4
6 1 7 9 5
7 1 7 1 7
Or if you want the reference later you can retain the stacked indexes :
old_.stack().reset_index(['index','gen_rating'])
index gen_rating dist rating
residency
New York LON 0 6 9 4
MUM 0 6 8 3
NYC 0 6 2 9
PAR 0 6 8 5
Paris LON 1 7 2 6
MUM 1 7 6 4
NYC 1 7 9 5
PAR 1 7 1 7
Upvotes: 1