Reputation: 1109
I have 2 pieces of data that I want to merge.
df1
is a pandas dataframe that contains a list of contracts, where year
is the year the contract was was executed, and o_id
refers to the id of the organization that this contract is from.
df2
is a pivot table comprised of an organization's problems over the years (where year is the year in which the audit was competed to check for organization problems). P_1
and P_2
refer to problem 1 and problem 2.
df1
c_id | o_id | year |
====================
101 | 10 | 2013 |
102 | 10 | 2014 |
103 | 10 | 2015 |
103 | 10 | 2016 |
121 | 12 | 2013 |
122 | 12 | 2014 |
123 | 12 | 2015 |
123 | 12 | 2016 |
df2
P_1 | P_2
year | 2013 | 2014 | 2015 | 2013 | 2014 | 2015 |
id |
================================================
10 | 1 | 0 | 0 | 0 | 0 | 0 |
12 | 0 | 1 | 0 | 1 | 1 | 0 |
The aim is to merge these two data sets in order to capture the 'history' of problems for each contract relative to the year in which that contract was > executed (merging on df1['o_id'] = df2['id']
).
Note that I cannot include history for the year in which the contract was executed (e.g. a 2015 contract can only use history from 2014 and before).
I'm looking to make the final output look like this:
id | year | 2013_P_1 | 2014_P_1 | 2015_P_1 | 2013_P_2 | 2014_P_2 | 2015_P_2
===============================================================================
10 | 2013 | NA | NA | NA | NA | NA | NA
10 | 2014 | 1 | NA | NA | 0 | NA | NA
10 | 2015 | 1 | 0 | NA | 0 | 0 | NA
10 | 2016 | 1 | 0 | 0 | 0 | 0 | 0
12 | 2013 | NA | NA | NA | NA | NA | NA
12 | 2014 | 0 | NA | NA | 1 | NA | NA
12 | 2015 | 0 | 1 | NA | 1 | 1 | NA
12 | 2016 | 0 | 1 | 0 | 1 | 1 | 0
Upvotes: 2
Views: 5946
Reputation: 863166
First reshape df2
by stack
and join
df1
, then replace values by NaN
s by custom function:
df = (df1.drop('c_id', 1)
.join(df2.stack(0).reset_index(level=1), on='o_id')
.set_index(['o_id','year', 'level_1']))
def f(x):
il1 = np.triu_indices(len(x.columns))
a = x.values.astype(float)
a[il1] = np.nan
x = pd.DataFrame(a, columns=x.columns, index=x.index)
return (x)
df = df.groupby(['o_id','level_1']).apply(f).unstack().sort_index(axis=1, level=1)
df.columns = ['{}_{}'.format(a,b) for a,b in df.columns]
df = df.reset_index()
print (df)
o_id year 2013_P_1 2014_P_1 2015_P_1 2013_P_2 2014_P_2 2015_P_2
0 10 2013 NaN NaN NaN NaN NaN NaN
1 10 2014 1.0 NaN NaN 0.0 NaN NaN
2 10 2015 1.0 0.0 NaN 0.0 0.0 NaN
3 10 2016 1.0 0.0 0.0 0.0 0.0 0.0
4 12 2013 NaN NaN NaN NaN NaN NaN
5 12 2014 0.0 NaN NaN 1.0 NaN NaN
6 12 2015 0.0 1.0 NaN 1.0 1.0 NaN
7 12 2016 0.0 1.0 0.0 1.0 1.0 0.0
Upvotes: 2
Reputation: 107687
Assuming you can update your pivot_table
specification, consider pivoting on a duplicate Year2 and keep Year in index. Then, rename the pivot columns and then run a left join merge:
from itertools import product
...
# NEW PIVOT (ADJUST ACCORDINGLY)
origdf['year2'] = origdf['year']
df2 = origdf.pivot_table(index=['ID', 'year'], columns=['problem'],
values=['year2', 'value'], aggfunc='max')
# RETURN CARTESIAN PRODUCT BETWEEN BOTH PIVOT COLUMN LEVELS
newcols = [str(i[1])+'_'+i[0]
for i in list(product(df2.columns.levels[0], df2.columns.levels[1]))]
# FLATTEN HIERARCHICAL COLUMNS
df2.columns = df2.columns.get_level_values(0)
# REASSIGN COLUMNS
df2.columns = newcols
# RESET MULTI-INDEX BACK AS DF COLUMNS
df2 = df2.reset_index()
# MERGE, DROP UNNEEDED COLS, RE-ORDER COLUMNS
finaldf = df1.merge(df2, left_on=['o_id'], right_on=['id'], how='left')\
.drop(columns=['c_id', 'o_id'])[['id','year'] + newcols]
Upvotes: 0