Reputation: 109
This is a weird one: I have 3 dataframes, "prov_data" which contains a provider id and counts on regions and categories (i.e. how many times that provider interacted with those regions and categories).
prov_data = DataFrame({'aprov_id':[1122,3344,5566,7788],'prov_region_1':[0,0,4,0],'prov_region_2':[2,0,0,0],
'prov_region_3':[0,1,0,1],'prov_cat_1':[0,2,0,0],'prov_cat_2':[1,0,3,0],'prov_cat_3':[0,0,0,4],
'prov_cat_4':[0,3,0,0]})
"tender_data" which contains the same but for tenders.
tender_data = DataFrame({'atender_id':['AA12','BB33','CC45'],
'ten_region_1':[0,0,1,],'ten_region_2':[0,1,0],
'ten_region_3':[1,1,0],'ten_cat_1':[1,0,0],
'ten_cat_2':[0,1,0],'ten_cat_3':[0,1,0],
'ten_cat_4':[0,0,1]})
And finally a "no_match" DF which contains forbidden matches between provider and tender.
no_match = DataFrame({ 'prov_id':[1122,3344,5566],
'tender_id':['AA12','BB33','CC45']})
I need to do the following: create a new df that will append the rows of the prov_data & tender_data DataFrames if they (1) match one or more categories (i.e. the same category is > 0) AND (2) match one or more regions AND (3) are not on the no_match list.
So that would give me this DF:
df = DataFrame({'aprov_id':[1122,3344,7788],'prov_region_1':[0,0,0],'prov_region_2':[2,0,0],
'prov_region_3':[0,1,1],'prov_cat_1':[0,2,0],'prov_cat_2':[1,0,0],'prov_cat_3':[0,0,4],
'prov_cat_4':[0,3,0], 'atender_id':['BB33','AA12','BB33'],
'ten_region_1':[0,0,0],'ten_region_2':[1,0,1],
'ten_region_3':[1,1,1],'ten_cat_1':[0,1,0],
'ten_cat_2':[1,0,1],'ten_cat_3':[1,0,1],
'ten_cat_4':[0,0,0]})
Upvotes: 3
Views: 79
Reputation: 3212
Straightforward solution that uses only "standard" pandas
techniques.
prov_data['tkey'] = 1
tender_data['tkey'] = 1
df1 = pd.merge(prov_data,tender_data,how='outer',on='tkey')
df1 = pd.merge(df1,no_match,how='outer',left_on = 'aprov_id', right_on = 'prov_id')
df1['dropData'] = df1.apply(lambda x: True if x['tender_id'] == x['atender_id'] else False, axis=1)
df1['dropData'] = df1.apply(lambda x: (x['dropData'] == True) or not(
((x['prov_cat_1'] > 0 and x['ten_cat_1'] > 0) or
(x['prov_cat_2'] > 0 and x['ten_cat_2'] > 0) or
(x['prov_cat_3'] > 0 and x['ten_cat_3'] > 0) or
(x['prov_cat_4'] > 0 and x['ten_cat_4'] > 0)) and(
(x['prov_region_1'] > 0 and x['ten_region_1'] > 0) or
(x['prov_region_2'] > 0 and x['ten_region_2'] > 0) or
(x['prov_region_3'] > 0 and x['ten_region_3'] > 0))),axis=1)
df1 = df1[~df1.dropData]
df1 = df1[[u'aprov_id', u'atender_id', u'prov_cat_1', u'prov_cat_2', u'prov_cat_3',
u'prov_cat_4', u'prov_region_1', u'prov_region_2', u'prov_region_3',
u'ten_cat_1', u'ten_cat_2', u'ten_cat_3', u'ten_cat_4', u'ten_region_1',
u'ten_region_2', u'ten_region_3']].reset_index(drop=True)
print df1.equals(df)
First we do a full cross product of both dataframes and merge that with the no_match
dataframe, then add a boolean column to mark all rows to be dropped.
The boolean column is assigned by two boolean lambda functions with all the necessary conditions, then we just take all rows where that column is False
.
This solution isn't very ressource-friendly due to the merge operations, so if your data is very large it may be disadvantageous.
Upvotes: 0
Reputation: 294508
code
# the first columns of each dataframe are the ids
# i'm going to use them several times
tid = tender_data.values[:, 0]
pid = prov_data.values[:, 0]
# first columns [1, 2, 3, 4] are cat columns
# we could have used filter, but this is good
# for this example
pc = prov_data.values[:, 1:5]
tc = tender_data.values[:, 1:5]
# columns [5, 6, 7] are rgn columns
pr = prov_data.values[:, 5:]
tr = tender_data.values[:, 5:]
# I want to mave this an m x n array, where
# m = number of rows in prov df and n = rows in tender
nm = no_match.groupby(['prov_id', 'tender_id']).size().unstack()
nm = nm.reindex_axis(tid, 1).reindex_axis(pid, 0)
nm = ~nm.fillna(0).astype(bool).values * 1
# the dot products of the cat arrays gets a handy
# array where there are > 1 co-positive values
# this combined with the a no_match construct
a = pd.DataFrame(pc.dot(tc.T) * pr.dot(tr.T) * nm > 0, pid, tid)
a = a.mask(~a).stack().index
fp = a.get_level_values(0)
ft = a.get_level_values(1)
pd.concat([
prov_data.set_index('aprov_id').loc[fp].reset_index(),
tender_data.set_index('atender_id').loc[ft].reset_index()
], axis=1)
index prov_cat_1 prov_cat_2 prov_cat_3 prov_cat_4 prov_region_1 \
0 1122 0 1 0 0 0
1 3344 2 0 0 3 0
2 7788 0 0 4 0 0
prov_region_2 prov_region_3 atender_id ten_cat_1 ten_cat_2 ten_cat_3 \
0 2 0 BB33 0 1 1
1 0 1 AA12 1 0 0
2 0 1 BB33 0 1 1
ten_cat_4 ten_region_1 ten_region_2 ten_region_3
0 0 0 1 1
1 0 0 0 1
2 0 0 1 1
explanation
Upvotes: 2