Reputation: 789
I am having trouble searching one list of numbers from one dataframe in the column of another. I am dealing with a case where different numbers refer to the same item. I am having trouble searching one list of numbers from one dataframe in the column of another. I am dealing with a case where different numbers refer to the same item.
In my sample code, "term2" has entries with both code 'AAAA' and 'CCCC'.
df1 = pd.DataFrame(
{
'term' : ['term1','term2','term3','term4','term1','term3','term2'],
'code': ['ZZZZ', 'AAAA','XXXX','QQQQ','ZZZZ','XXXX','CCCC'],
'count':['1','3','1','1','1','1','3']
}
)
df1 = df1[['term'] + df1.columns[:-1].tolist()]
df2 = pd.DataFrame(
{
'name': ['Dan','Sara','Conroy','Steve'],
'rate': ['3','3.5','5.2','3'],
'location': ['FL','OH','NM','NM'],
'code': ['XXXX','BBBB','PPPP','TTTT'],
'allcodes': ['XXXX,YYYY,ZZZZ','AAAA,BBBB,CCCC','PPPP,QQQQ,RRRR','SSSS,TTTT,UUUU']
}
)
df2 = df2[['name','rate','location','code','allcodes']]
pmissnlist = df1['code'].values.tolist()
df3 = df2[df2['allcodes'].str.contains('|'.join(pmissnlist))].reset_index()
df4 = pd.DataFrame(df3['allcodes'].str.split(',').tolist(),
index=df3.code).stack(level=0).reset_index(level=0)
matches = df4[df4[0].isin(df1.code)]
df5 = df3[df3.code.isin(matches.code)]
df5 = df5.drop_duplicates()#drops duplicate rows
df5_temp = df5.set_index(['name','rate','location','code']).allcodes.str.split(',',expand = True)\
.stack().reset_index(4,drop = True).reset_index(name = 'allcodes')
df5 = pd.merge(df1, df5_temp, left_on = 'code', right_on = 'allcodes')
df5 = df5.drop('allcodes', 1).groupby(['code_y','term','count','name','rate','location']).code_x.\
apply(','.join).reset_index().sort_values(by='count', ascending=False)
print(df5)
The output is:
code_y term count name rate location code_x
0 BBBB term2 3 Sara 3.5 OH AAAA,CCCC
1 PPPP term4 1 Conroy 5.2 NM QQQQ
2 XXXX term1 1 Dan 3 FL ZZZZ,ZZZZ
3 XXXX term3 1 Dan 3 FL XXXX,XXXX
There are two instances of term2 associated with Sara, so I want her count to be 6, not 3. The counts should be 2 for each of Dan's rows, too. If I understand the dataframe correctly, this is expected: the dataframe contains the data that satisfies the condition of the first term2. When the second instance is compared, the related data already exists in the dataframe.
How do I sum the count values where multiple codes are matched in the "allcode" field of df2?
EDIT: Adding desired outcome
code_y term count name rate location code_x
0 BBBB term2 6 Sara 3.5 OH AAAA,CCCC
1 PPPP term4 1 Conroy 5.2 NM QQQQ
2 XXXX term1 2 Dan 3 FL ZZZZ,ZZZZ
3 XXXX term3 2 Dan 3 FL XXXX,XXXX
EDIT2: The problem is with the groupby.
Here is the output of df5 BEFORE the statement with the groupby:
term code_x count name rate location code_y allcodes
0 term1 ZZZZ 1 Dan 3 FL XXXX ZZZZ
1 term1 ZZZZ 1 Dan 3 FL XXXX ZZZZ
2 term2 AAAA 3 Sara 3.5 OH BBBB AAAA
3 term3 XXXX 1 Dan 3 FL XXXX XXXX
4 term3 XXXX 1 Dan 3 FL XXXX XXXX
5 term4 QQQQ 1 Conroy 5.2 NM PPPP QQQQ
6 term2 CCCC 2 Sara 3.5 OH BBBB CCCC
7 term6 TTTT 1 Steve 3 NM TTTT TTTT
So, essentially I want counts for the column with the same term (term2 for Sara, and term3 & term4 for Dan) before the groupby.
Upvotes: 0
Views: 883
Reputation: 323276
IIUC , By using str.split
and len
df['count']*=df['code_x'].str.split(',').apply(len)
df
Out[1105]:
code_y term count name rate location code_x
0 BBBB term2 6 Sara 3.5 OH AAAA,CCCC
1 PPPP term4 1 Conroy 5.2 NM QQQQ
2 XXXX term1 2 Dan 3.0 FL ZZZZ,ZZZZ
3 XXXX term3 2 Dan 3.0 FL XXXX,XXXX
EDIT :
df.groupby(['code_y','term'],as_index=False).agg({'count':'sum','name':'first','rate':'first','location':'first','code_x':lambda x : ','.join(x),'allcodes':'first'})
Out[1130]:
code_y term count name rate location code_x allcodes
0 BBBB term2 5 Sara 3.5 OH AAAA,CCCC AAAA
1 PPPP term4 1 Conroy 5.2 NM QQQQ QQQQ
2 TTTT term6 1 Steve 3.0 NM TTTT TTTT
3 XXXX term1 2 Dan 3.0 FL ZZZZ,ZZZZ ZZZZ
4 XXXX term3 2 Dan 3.0 FL XXXX,XXXX XXXX
Upvotes: 2
Reputation: 9711
I created a small function to get the expected output.
def f(x):
return pd.Series(dict(count = x['count'].sum(),
code_x = "%s" % ', '.join(x['code_x'])))
df5=df5.groupby(['code_y','term','name','rate','location'])[['code_x','count']].apply(f).reset_index()
output
code_y term name rate location code_x count
0 BBBB term2 Sara 3.5 OH AAAA, CCCC 6
1 PPPP term4 Conroy 5.2 NM QQQQ 1
2 XXXX term1 Dan 3 FL ZZZZ, ZZZZ 2
3 XXXX term3 Dan 3 FL XXXX, XXXX 2
Inspired by https://stackoverflow.com/a/17841294/6626530
Upvotes: 1