Dance Party
Dance Party

Reputation: 3713

Pandas Pivot Table with Fancy Stacking

Given the following data frames:

results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
                    'President':[1,0,0,0,1,1],
                   'Item 1':[1,1,0,0,1,np.nan],
                   'Item 2':[1,0,0,1,0,1]})
results[['Contractor','President','Item 1','Item 2']]

results

    Contractor  President   Item 1  Item 2
0        1          1       1       1
1        1          0       1       0
2        0          0       0       0
3        0          0       0       1
4        0          1       1       0
5        1          1       NaN     1

and this for reference to items (see below):

    Position    Item(s)
0   Contractor  1
1   President   1,2

...I would like to pivot the data to produce this:

    Position    Overall%
0   Contractor  100
1   President   80

...based on this logic:

Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2. The sum across items is 4 and the count across items is 5 (do not count 'NaN'), which gives 80%.

Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%

Thanks in advance!

Upvotes: 2

Views: 89

Answers (1)

unutbu
unutbu

Reputation: 879701

import numpy as np
import pandas as pd

results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
                    'President':[1,0,0,0,1,1],
                   'Item 1':[1,1,0,0,1,np.nan],
                   'Item 2':[1,0,0,1,0,1]})
reference =  pd.DataFrame({'Position':['Contractor','President'],
                           'Item(s)':[(1,), (1,2)]})

longref = pd.DataFrame([('Item {}'.format(item), row['Position']) 
                        for index, row in reference.iterrows() 
                        for item in row['Item(s)']], columns=['Item', 'Position'])
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
melted = melted.loc[melted['value']==1]
melted = pd.melt(melted, id_vars=['Position'], 
                 value_vars=['Item 1','Item 2'], var_name='Item')
merged = pd.merge(longref, melted, how='left')
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
print(result)

yields

     Position  Overall%
0  Contractor     100.0
1   President      80.0

Explanation: There is an article by Hadley Wickham (PDF) propounding the advantages of making data "tidy". The main tenet is that each row should represent an "observation" and each column represent some factor or variable.

It frequently turns out that the tools you will need to express your calculation will fall into place quite naturally once the data is tidy. The difficulty of this problem largely comes from the data not being tidy.

Consider results:

In [405]: results
Out[405]: 
   Contractor  Item 1  Item 2  President
0           1     1.0       1          1
1           1     1.0       0          0
2           0     0.0       0          0
3           0     0.0       1          0
4           0     1.0       0          1
5           1     NaN       1          1

Instead of having separate columns for Contractor and President, it would be nicer to have one column called Position, since Position is the variable, and each observation or row can have one value for Position -- either a Contractor or a President. Similarly, Item 1 and Item 2 should be coalesced into a single column Item:

In [416]: melted
Out[416]: 
      Position    Item  value
0   Contractor  Item 1    1.0
1   Contractor  Item 1    1.0
2   Contractor  Item 1    NaN
3    President  Item 1    1.0
4    President  Item 1    1.0
5    President  Item 1    NaN
6   Contractor  Item 2    1.0
7   Contractor  Item 2    0.0
8   Contractor  Item 2    1.0
9    President  Item 2    1.0
10   President  Item 2    0.0
11   President  Item 2    1.0

melted contains the same information as results, but in a tidy format. The value column contains the values in results[['Item 1', 'Item 2']]. Each row corresponds to an "observation" where either results['Contractor'] or result['President']` equals 1, since the calculation's logic only requires these values.

Similarly, instead of

In [407]: reference
Out[407]: 
  Item(s)    Position
0    (1,)  Contractor
1  (1, 2)   President

it would be tidier to have a DataFrame whose columns are Item and Position:

In [408]: longref
Out[408]: 
     Item    Position
0  Item 1  Contractor
1  Item 1   President
2  Item 2   President

Once you have the tidy version of your data in the form of melted and longref, calculating the desired result is fairly straight-forward:

merged = pd.merge(longref, melted, how='left')
#      Item    Position  value
# 0  Item 1  Contractor    1.0
# 1  Item 1  Contractor    1.0
# 2  Item 1  Contractor    NaN
# 3  Item 1   President    1.0
# 4  Item 1   President    1.0
# 5  Item 1   President    NaN
# 6  Item 2   President    1.0
# 7  Item 2   President    0.0
# 8  Item 2   President    1.0

grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()

How to tidy-up reference to make longref:

Just iterate through the rows of reference and for each row iterate through the tuple of items to build the new DataFrame, longref:

longref = pd.DataFrame([('Item {}'.format(item), row['Position']) 
                        for index, row in reference.iterrows() 
                        for item in row['Item(s)']], columns=['Item', 'Position'])

How to tidy-up results to make melted:

It can be done with two calls to pd.melt. pd.melt converts "wide" format to "long" format DataFrames. It can coalesce multiple columns into a single column. For example, to coalesce the Contractor and President columns into a single Position column you could use:

melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
# we only care about rows where Contractor or President value was 1. So use .loc to select those rows.
melted = melted.loc[melted['value']==1]
#     Item 1  Item 2    Position  value
# 0      1.0       1  Contractor      1
# 1      1.0       0  Contractor      1
# 5      NaN       1  Contractor      1
# 6      1.0       1   President      1
# 10     1.0       0   President      1
# 11     NaN       1   President      1

and similarly, to coalesce the Item 1 and Item 2 columns into a single Item column, use:

melted = pd.melt(melted, id_vars=['Position'], 
                 value_vars=['Item 1','Item 2'], var_name='Item')
#       Position    Item  value
# 0   Contractor  Item 1    1.0
# 1   Contractor  Item 1    1.0
# 2   Contractor  Item 1    NaN
# 3    President  Item 1    1.0
# 4    President  Item 1    1.0
# 5    President  Item 1    NaN
# 6   Contractor  Item 2    1.0
# 7   Contractor  Item 2    0.0
# 8   Contractor  Item 2    1.0
# 9    President  Item 2    1.0
# 10   President  Item 2    0.0
# 11   President  Item 2    1.0

Upvotes: 1

Related Questions