Reputation: 3379
Say I have the following dataframe where I have counts of some variable at 2 different seasons in 2 different years and in 3 different locations. The data is currently structured where each row is a Season / Location combination and there are columns for the count in each year. It looks something like this:
>>> df=pd.DataFrame([['Summer', 'A', 1, 2],
['Winter', 'A', 3, 4],
['Summer', 'B', 5, 6],
['Winter', 'B', 7, 8],
['Summer', 'C', 9, 10],
['Winter', 'C', 11, 12]],
columns=['Season', 'Location', 'Count_2014', 'Count_2015'])
>>> df
Season Location Count_2014 Count_2015
0 Summer A 1 2
1 Winter A 3 4
2 Summer B 5 6
3 Winter B 7 8
4 Summer C 9 10
5 Winter C 11 12
I want to restructure the data so that I have a row for each Season, Location, and Year combination (meaning I would have 2 x 3 x 2 = 12 rows). My current approach to doing so is certainly not the most efficient (see below). Any suggestions on the best way to restructure this dataset?
df.set_index(['Season', 'Location'], inplace=True)
ListOfDFs = []
for Year in [x[-4:] for x in df.columns]:
SubD = df[['Count_' + Year]]
SubD.columns = ['Count']
SubD['Year'] = Year
SubD.set_index('Year', append=True, inplace=True)
ListOfDFs.append(SubD)
df2=pd.concat(ListOfDFs)
>>> df2
Count
Season Location Year
Summer A 2014 1
Winter A 2014 3
Summer B 2014 5
Winter B 2014 7
Summer C 2014 9
Winter C 2014 11
Summer A 2015 2
Winter A 2015 4
Summer B 2015 6
Winter B 2015 8
Summer C 2015 10
Winter C 2015 12
Upvotes: 1
Views: 273
Reputation: 3379
And as another option, it looks like stack() also gets the job done:
>>> df=pd.DataFrame([['Summer','A',1,2],['Winter','A',3,4],['Summer','B',5,6],['Winter','B',7,8],['Summer','C',9,10],['Winter','C',11,12]], columns=['Season','Location','Count_2014','Count_2015'])
>>>
>>> df.set_index(['Season','Location'], inplace=True)
>>> df.columns=pd.MultiIndex.from_tuples([(col[-4:],col[:-5]) for col in df.columns], names=['Year','Count'])
>>> df=df.stack(level=0)
>>> df
Count Count
Season Location Year
Summer A 2014 1
2015 2
Winter A 2014 3
2015 4
Summer B 2014 5
2015 6
Winter B 2014 7
2015 8
Summer C 2014 9
2015 10
Winter C 2014 11
2015 12
>>>
Upvotes: 0
Reputation: 2415
You are looking for the melt functionality, which will allow you to do this in basically one line:
df_new = pd.melt(df,id_vars=['Season', 'Location'], value_vars=['Count_2014', 'Count_2015'],
var_name='Year',
value_name='Count')
Then you can use apply
(or there probably is something better), to get the output you have above:
df_new['Year'] = df_new['Year'].apply(lambda x: x[-4:])
Output:
Season Location Year Count
0 Summer A 2014 1
1 Winter A 2014 3
2 Summer B 2014 5
3 Winter B 2014 7
4 Summer C 2014 9
5 Winter C 2014 11
6 Summer A 2015 2
7 Winter A 2015 4
8 Summer B 2015 6
9 Winter B 2015 8
10 Summer C 2015 10
11 Winter C 2015 12
Upvotes: 4