Reputation: 7536
Given the following:
import pandas as pd
import numpy as np
df=pd.DataFrame({'County':['A','B','A','B','A','B','A','B','A','B'],
'Hospital':['a','b','e','f','i','j','m','n','b','r'],
'Enrollment':[44,55,95,54,81,54,89,76,1,67],
'Year':['2012','2012','2012','2012','2012','2013',
'2013','2013','2013','2013']})
d2=pd.pivot_table(df,index=['County','Hospital'],columns=['Year'])#.sort_columns
d2
Enrollment
Year 2012 2013
County Hospital
A a 44.0 NaN
b NaN 1.0
e 95.0 NaN
i 81.0 NaN
m NaN 89.0
B b 55.0 NaN
f 54.0 NaN
j NaN 54.0
n NaN 76.0
r NaN 67.0
If a hospital such as 'b' exists more than once and it has no data for the previous year (the first occurrence of 'b'), I'd like to assign the previous year value of Enrollment for the other row ('b') and delete the row of 'b' that does not contain data for the first year like this:
Enrollment
Year 2012 2013
County Hospital
A a 44.0 NaN
b 55.0 1.0
e 95.0 NaN
i 81.0 NaN
m NaN 89.0
B f 54.0 NaN
j NaN 54.0
n NaN 76.0
r NaN 67.0
So far, I can identify the duplicate row and delete, but I'm just stuck on replacing NaN with a value where needed:
Identify duplicate hospitals after resetting the index:
d2=d2.reset_index()
d2['dup']=d2.duplicated('Hospital',keep=False)
Flag, for deletion, the duplicate hospital with no data for the most recent year:
Hospital=d2.columns.levels[0][1]
Y1=d2.columns.levels[1][0]
Y2=d2.columns.levels[1][1]
d2['Delete']=np.nan
d2.loc[(pd.isnull(d2.Enrollment[Y2]))&(d2['dup']==True),'Delete']='Yes'
Keep all but rows to delete:
d2=d2.loc[d2['Delete']!='Yes']
Upvotes: 3
Views: 957
Reputation: 879739
If I understand correctly, the problem is to copy values from County B into County A when the Hospital matches. That can be done with groupby/fillna(method='bfill')
. The bfill
method backfills NaNs with the closest succeeding non-NaN value.
Then, you can use d2.drop_duplicates(subset=['Hospital'], keep='first')
to keep the first row when the Hospital matches.
For example,
import pandas as pd
df = pd.DataFrame({'County': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Hospital': ['a', 'b', 'e', 'f', 'i', 'j', 'm', 'n', 'b', 'r'],
'Enrollment': [44, 55, 95, 54, 81, 54, 89, 76, 1, 67],
'Year': ['2012', '2012', '2012', '2012', '2012', '2013',
'2013', '2013', '2013', '2013']})
d2 = pd.pivot_table(df, index=['County', 'Hospital'], columns=['Year'])
d2 = d2.groupby(level='Hospital').fillna(method='bfill')
d2 = d2.reset_index()
d2 = d2.drop_duplicates(subset=['Hospital'], keep='first')
yields
County Hospital Enrollment
Year 2012 2013
0 A a 44.0 NaN
1 A b 55.0 1.0
2 A e 95.0 NaN
3 A i 81.0 NaN
4 A m NaN 89.0
6 B f 54.0 NaN
7 B j NaN 54.0
8 B n NaN 76.0
9 B r NaN 67.0
Upvotes: 3
Reputation: 294338
manipulate d2
to put A
and B
side by side.
e = d2.unstack(0).swaplevel(1, 2, 1).sort_index(1).Enrollment
print e
County A B
Year 2012 2013 2012 2013
Hospital
a 44.0 NaN NaN NaN
b NaN 1.0 55.0 NaN
e 95.0 NaN NaN NaN
f NaN NaN 54.0 NaN
i 81.0 NaN NaN NaN
j NaN NaN NaN 54.0
m NaN 89.0 NaN NaN
n NaN NaN NaN 76.0
r NaN NaN NaN 67.0
create an apply function to assign values from B
and nullify B
afterwards.
def manipulate_rows(row):
if pd.notnull(row.loc['A'].iloc[1]) & pd.isnull(row.loc['A'].iloc[0]):
row.A = row.A.combine_first(row.B)
row.B = np.nan
return row
d3 = e.apply(manipulate_rows, axis=1).stack(0).swaplevel(0, 1).sort_index()
Stacking will naturally drop the missing values after the pivot.
Reassign the d2
columns
d3.columns = d2.columns
print d3
Enrollment
Year 2012 2013
County Hospital
A a 44.0 NaN
b 55.0 1.0
e 95.0 NaN
i 81.0 NaN
m NaN 89.0
B f 54.0 NaN
j NaN 54.0
n NaN 76.0
r NaN 67.0
Upvotes: 1