AJG519
AJG519

Reputation: 3379

Reshape Pandas Dataframe

Say I have the following dataframe where I have counts of some variable at 2 different seasons in 2 different years and in 3 different locations. The data is currently structured where each row is a Season / Location combination and there are columns for the count in each year. It looks something like this:

>>> df=pd.DataFrame([['Summer', 'A', 1, 2],
                     ['Winter', 'A', 3, 4],
                     ['Summer', 'B', 5, 6],
                     ['Winter', 'B', 7, 8],
                     ['Summer', 'C', 9, 10],
                     ['Winter', 'C', 11, 12]], 
                    columns=['Season', 'Location', 'Count_2014', 'Count_2015'])
>>> df
   Season Location  Count_2014  Count_2015
0  Summer        A           1           2
1  Winter        A           3           4
2  Summer        B           5           6
3  Winter        B           7           8
4  Summer        C           9          10
5  Winter        C          11          12

I want to restructure the data so that I have a row for each Season, Location, and Year combination (meaning I would have 2 x 3 x 2 = 12 rows). My current approach to doing so is certainly not the most efficient (see below). Any suggestions on the best way to restructure this dataset?

df.set_index(['Season', 'Location'], inplace=True)
ListOfDFs = []
for Year in [x[-4:] for x in df.columns]:
    SubD = df[['Count_' + Year]]
    SubD.columns = ['Count']
    SubD['Year'] = Year
    SubD.set_index('Year', append=True, inplace=True)
    ListOfDFs.append(SubD)

df2=pd.concat(ListOfDFs)
>>> df2
                      Count
Season Location Year       
Summer A        2014      1
Winter A        2014      3
Summer B        2014      5
Winter B        2014      7
Summer C        2014      9
Winter C        2014     11
Summer A        2015      2
Winter A        2015      4
Summer B        2015      6
Winter B        2015      8
Summer C        2015     10
Winter C        2015     12

Upvotes: 1

Views: 273

Answers (2)

AJG519
AJG519

Reputation: 3379

And as another option, it looks like stack() also gets the job done:

>>> df=pd.DataFrame([['Summer','A',1,2],['Winter','A',3,4],['Summer','B',5,6],['Winter','B',7,8],['Summer','C',9,10],['Winter','C',11,12]], columns=['Season','Location','Count_2014','Count_2015'])
>>> 
>>> df.set_index(['Season','Location'], inplace=True)
>>> df.columns=pd.MultiIndex.from_tuples([(col[-4:],col[:-5]) for col in df.columns], names=['Year','Count'])
>>> df=df.stack(level=0)
>>> df
Count                 Count
Season Location Year       
Summer A        2014      1
                2015      2
Winter A        2014      3
                2015      4
Summer B        2014      5
                2015      6
Winter B        2014      7
                2015      8
Summer C        2014      9
                2015     10
Winter C        2014     11
                2015     12
>>> 

Upvotes: 0

chappers
chappers

Reputation: 2415

You are looking for the melt functionality, which will allow you to do this in basically one line:

df_new = pd.melt(df,id_vars=['Season', 'Location'], value_vars=['Count_2014', 'Count_2015'], 
        var_name='Year', 
        value_name='Count')

Then you can use apply (or there probably is something better), to get the output you have above:

df_new['Year'] = df_new['Year'].apply(lambda x: x[-4:])

Output:

    Season Location  Year  Count
0   Summer        A  2014      1
1   Winter        A  2014      3
2   Summer        B  2014      5
3   Winter        B  2014      7
4   Summer        C  2014      9
5   Winter        C  2014     11
6   Summer        A  2015      2
7   Winter        A  2015      4
8   Summer        B  2015      6
9   Winter        B  2015      8
10  Summer        C  2015     10
11  Winter        C  2015     12

Upvotes: 4

Related Questions