Reputation: 77
I'm stuck with the following situation: I've got statistical data covering multiple years, each year being retrieved from a different file read into a dataframe.
So basically I have multiple dataframes, one per year, each looking like the follinwing:
cat geo1 geo2 geo3 0 A 0.709238 0.669532 -0.465389 1 B -1.426102 -0.152918 0.700080 2 C -0.486294 -1.619334 -1.711047 3 D 0.392837 -0.754785 -1.686076 4 E 0.603256 0.997562 -0.534222
With 'cat' being a product category and geo1 through geo3 being geographic regions
What I would like to end up with is the following:
year1 year2 year(n) cat region A geo1 geo2 geo3 B geo1 geo2 geo3
However, I can't seem to get that done.
Any hint how to go about this is very much appreciated! Best, Joe
Upvotes: 0
Views: 39
Reputation: 3138
If you load multiple years in one table (possibly concatenate tables for each year, with an identifier variable yearX
):
In []: df
Out[]:
cat geo1 geo2 geo3 year
0 A 0.709238 0.669532 -0.465389 year1
1 B -1.426102 -0.152918 0.700080 year1
2 C -0.486294 -1.619334 -1.711047 year1
3 D 0.392837 -0.754785 -1.686076 year1
4 E 0.603256 0.997562 -0.534222 year1
5 A 0.709238 0.669532 -0.465389 year2
6 B -1.426102 -0.152918 0.700080 year2
7 C -0.486294 -1.619334 -1.711047 year2
8 D 0.392837 -0.754785 -1.686076 year2
9 E 0.603256 0.997562 -0.534222 year2
Use melt()
to tidy your data:
In []: df = df.melt(value_vars=[c for c in df.columns if 'geo' in c],
var_name='region',
id_vars=['cat', 'year'])
In []: df
Out[]:
cat year region value
0 A year1 geo1 0.709238
1 B year1 geo1 -1.426102
2 C year1 geo1 -0.486294
3 D year1 geo1 0.392837
4 E year1 geo1 0.603256
5 A year2 geo1 0.709238
6 B year2 geo1 -1.426102
7 C year2 geo1 -0.486294
8 D year2 geo1 0.392837
9 E year2 geo1 0.603256
10 A year1 geo2 0.669532
11 B year1 geo2 -0.152918
12 C year1 geo2 -1.619334
13 D year1 geo2 -0.754785
14 E year1 geo2 0.997562
15 A year2 geo2 0.669532
16 B year2 geo2 -0.152918
17 C year2 geo2 -1.619334
18 D year2 geo2 -0.754785
19 E year2 geo2 0.997562
20 A year1 geo3 -0.465389
21 B year1 geo3 0.700080
22 C year1 geo3 -1.711047
23 D year1 geo3 -1.686076
24 E year1 geo3 -0.534222
25 A year2 geo3 -0.465389
26 B year2 geo3 0.700080
27 C year2 geo3 -1.711047
28 D year2 geo3 -1.686076
29 E year2 geo3 -0.534222
Then pivot:
In []: df.pivot_table(columns='year', index=['cat', 'region'])
Out[]:
value
year year1 year2
cat region
A geo1 0.709238 0.709238
geo2 0.669532 0.669532
geo3 -0.465389 -0.465389
B geo1 -1.426102 -1.426102
geo2 -0.152918 -0.152918
geo3 0.700080 0.700080
C geo1 -0.486294 -0.486294
geo2 -1.619334 -1.619334
geo3 -1.711047 -1.711047
D geo1 0.392837 0.392837
geo2 -0.754785 -0.754785
geo3 -1.686076 -1.686076
E geo1 0.603256 0.603256
geo2 0.997562 0.997562
geo3 -0.534222 -0.534222
Upvotes: 1