Reputation: 3713
Given the following data frames:
results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
'President':[1,0,0,0,1,1],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
results[['Contractor','President','Item 1','Item 2']]
results
Contractor President Item 1 Item 2
0 1 1 1 1
1 1 0 1 0
2 0 0 0 0
3 0 0 0 1
4 0 1 1 0
5 1 1 NaN 1
and this for reference to items (see below):
Position Item(s)
0 Contractor 1
1 President 1,2
...I would like to pivot the data to produce this:
Position Overall%
0 Contractor 100
1 President 80
...based on this logic:
Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2. The sum across items is 4 and the count across items is 5 (do not count 'NaN'), which gives 80%.
Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%
Thanks in advance!
Upvotes: 2
Views: 89
Reputation: 879701
import numpy as np
import pandas as pd
results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
'President':[1,0,0,0,1,1],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
reference = pd.DataFrame({'Position':['Contractor','President'],
'Item(s)':[(1,), (1,2)]})
longref = pd.DataFrame([('Item {}'.format(item), row['Position'])
for index, row in reference.iterrows()
for item in row['Item(s)']], columns=['Item', 'Position'])
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
melted = melted.loc[melted['value']==1]
melted = pd.melt(melted, id_vars=['Position'],
value_vars=['Item 1','Item 2'], var_name='Item')
merged = pd.merge(longref, melted, how='left')
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
print(result)
yields
Position Overall%
0 Contractor 100.0
1 President 80.0
Explanation: There is an article by Hadley Wickham (PDF) propounding the advantages of making data "tidy". The main tenet is that each row should represent an "observation" and each column represent some factor or variable.
It frequently turns out that the tools you will need to express your calculation will fall into place quite naturally once the data is tidy. The difficulty of this problem largely comes from the data not being tidy.
Consider results
:
In [405]: results
Out[405]:
Contractor Item 1 Item 2 President
0 1 1.0 1 1
1 1 1.0 0 0
2 0 0.0 0 0
3 0 0.0 1 0
4 0 1.0 0 1
5 1 NaN 1 1
Instead of having separate columns for Contractor
and President
, it would be nicer to have one column called Position
, since Position
is the variable, and each observation or row can have one value for Position
-- either a Contractor
or a President
.
Similarly, Item 1
and Item 2
should be coalesced into a single column Item
:
In [416]: melted
Out[416]:
Position Item value
0 Contractor Item 1 1.0
1 Contractor Item 1 1.0
2 Contractor Item 1 NaN
3 President Item 1 1.0
4 President Item 1 1.0
5 President Item 1 NaN
6 Contractor Item 2 1.0
7 Contractor Item 2 0.0
8 Contractor Item 2 1.0
9 President Item 2 1.0
10 President Item 2 0.0
11 President Item 2 1.0
melted
contains the same information as results
, but in a tidy format. The value
column contains the values in results[['Item 1', 'Item 2']]
. Each row corresponds to an "observation" where either results['Contractor']
or result['President']` equals 1, since the calculation's logic only requires these values.
Similarly, instead of
In [407]: reference
Out[407]:
Item(s) Position
0 (1,) Contractor
1 (1, 2) President
it would be tidier to have a DataFrame whose columns are Item
and Position
:
In [408]: longref
Out[408]:
Item Position
0 Item 1 Contractor
1 Item 1 President
2 Item 2 President
Once you have the tidy version of your data in the form of melted
and longref
,
calculating the desired result is fairly straight-forward:
merged = pd.merge(longref, melted, how='left')
# Item Position value
# 0 Item 1 Contractor 1.0
# 1 Item 1 Contractor 1.0
# 2 Item 1 Contractor NaN
# 3 Item 1 President 1.0
# 4 Item 1 President 1.0
# 5 Item 1 President NaN
# 6 Item 2 President 1.0
# 7 Item 2 President 0.0
# 8 Item 2 President 1.0
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
How to tidy-up reference
to make longref
:
Just iterate through the rows of reference
and for each row iterate through the tuple of items to build the new DataFrame, longref
:
longref = pd.DataFrame([('Item {}'.format(item), row['Position'])
for index, row in reference.iterrows()
for item in row['Item(s)']], columns=['Item', 'Position'])
How to tidy-up results
to make melted
:
It can be done with two calls to pd.melt
. pd.melt
converts "wide" format to "long" format DataFrames. It can coalesce multiple columns into a single column. For example, to coalesce the Contractor and President columns into a single Position column you could use:
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
# we only care about rows where Contractor or President value was 1. So use .loc to select those rows.
melted = melted.loc[melted['value']==1]
# Item 1 Item 2 Position value
# 0 1.0 1 Contractor 1
# 1 1.0 0 Contractor 1
# 5 NaN 1 Contractor 1
# 6 1.0 1 President 1
# 10 1.0 0 President 1
# 11 NaN 1 President 1
and similarly, to coalesce the Item 1
and Item 2
columns into a single Item
column, use:
melted = pd.melt(melted, id_vars=['Position'],
value_vars=['Item 1','Item 2'], var_name='Item')
# Position Item value
# 0 Contractor Item 1 1.0
# 1 Contractor Item 1 1.0
# 2 Contractor Item 1 NaN
# 3 President Item 1 1.0
# 4 President Item 1 1.0
# 5 President Item 1 NaN
# 6 Contractor Item 2 1.0
# 7 Contractor Item 2 0.0
# 8 Contractor Item 2 1.0
# 9 President Item 2 1.0
# 10 President Item 2 0.0
# 11 President Item 2 1.0
Upvotes: 1