Dumbo
Dumbo

Reputation: 1098

Reshaping table in pandas

I want to reshape a table in pandas . I have a table of the form:

date | country |state | population | num_cars
1    | c1      | s1   | 1          | 1
2    | c1      | s1   | 1          | 1
1    | c1      | s2   | 1          | 1
.
2    | c2      | s2   | 1          | 2
2    | c2      | s2   | 1          | 2

I want to turn it to this shape:

date |1_population | c1_s1_population | c1_s2_population...| c2_s1_populationc1_num_cars |c2_11_num_cars...

To explain, the initial data has pop and numbers by country, state for a date range. Now I want to convert into a number of columns of time series for each level (country, country-state)

How do I do this?

Upvotes: 0

Views: 131

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30971

As the source data sample, I used a DataFrame with 2 hypothetical countries, 3 states each:

    date country state  population  num_cars
0   1990     Xxx   Aaa         100        15
1   2010     Xxx   Aaa         120        18
2   1990     Xxx   Bbb          80         9
3   2010     Xxx   Bbb          88        11
4   1990     Xxx   Ccc          75         6
5   2010     Xxx   Ccc          82         8
6   1990     Yyy   Ggg          40         5
7   2010     Yyy   Ggg          50         6
8   1990     Yyy   Hhh          30         3
9   2010     Yyy   Hhh          38         4
10  1990     Yyy   Jjj          29         3
11  2010     Yyy   Jjj          35         4

To solve your problem, start with defining a reformatting function:

def reformat(grp, col):
    pop = grp[col]
    pop.name = grp.date.iloc[0]
    return pop

From a group of rows (grp) it takes a column of particular name (col), sets the name as date from the first row (the grouping key) and returns it.

As the initial step, group df by country and state:

gr = df.set_index(['country', 'state']).groupby('date')

Then compute 2 DataFrames, as the result of reformatting (applying the above function to each group, for both columns of interest:

df1 = gr.apply(reformat, col='population')
df2 = gr.apply(reformat, col='num_cars')

And having two partial results, merge them on indices:

pd.merge(df1, df2, left_index=True, right_index=True,
    suffixes=('_pop', '_cars'))

The result is:

country Xxx_pop         Yyy_pop         Xxx_cars         Yyy_cars        
state       Aaa Bbb Ccc     Ggg Hhh Jjj      Aaa Bbb Ccc      Ggg Hhh Jjj
date                                                                     
1990        100  80  75      40  30  29       15   9   6        5   3   3
2010        120  88  82      50  38  35       18  11   8        6   4   4

As you can see, the top level of MultiIndex on columns is "Country / population" and "Country / car No". The other level contains state names.

To trace how tis solution works, execute each step separately and inspect its result.

Upvotes: 1

Related Questions