mattrweaver
mattrweaver

Reputation: 789

Pandas: sum values in column if multiple values match str/contains/join

I am having trouble searching one list of numbers from one dataframe in the column of another. I am dealing with a case where different numbers refer to the same item. I am having trouble searching one list of numbers from one dataframe in the column of another. I am dealing with a case where different numbers refer to the same item.

In my sample code, "term2" has entries with both code 'AAAA' and 'CCCC'.

    df1 = pd.DataFrame(
        {
        'term' : ['term1','term2','term3','term4','term1','term3','term2'],
        'code': ['ZZZZ', 'AAAA','XXXX','QQQQ','ZZZZ','XXXX','CCCC'],
        'count':['1','3','1','1','1','1','3']
        }
        )
df1 = df1[['term'] + df1.columns[:-1].tolist()]
df2 = pd.DataFrame(
        {
        'name': ['Dan','Sara','Conroy','Steve'],
        'rate': ['3','3.5','5.2','3'],
        'location': ['FL','OH','NM','NM'],
        'code': ['XXXX','BBBB','PPPP','TTTT'],
        'allcodes': ['XXXX,YYYY,ZZZZ','AAAA,BBBB,CCCC','PPPP,QQQQ,RRRR','SSSS,TTTT,UUUU']
         }
         )

df2 = df2[['name','rate','location','code','allcodes']]
pmissnlist = df1['code'].values.tolist()
df3 = df2[df2['allcodes'].str.contains('|'.join(pmissnlist))].reset_index()

df4 = pd.DataFrame(df3['allcodes'].str.split(',').tolist(),
                            index=df3.code).stack(level=0).reset_index(level=0)
matches = df4[df4[0].isin(df1.code)]
df5 = df3[df3.code.isin(matches.code)]
df5 = df5.drop_duplicates()#drops duplicate rows

df5_temp = df5.set_index(['name','rate','location','code']).allcodes.str.split(',',expand = True)\
                                  .stack().reset_index(4,drop = True).reset_index(name = 'allcodes')
df5 = pd.merge(df1, df5_temp, left_on = 'code', right_on = 'allcodes')
df5 = df5.drop('allcodes', 1).groupby(['code_y','term','count','name','rate','location']).code_x.\
                                    apply(','.join).reset_index().sort_values(by='count', ascending=False)
print(df5)

The output is:

      code_y   term   count  name   rate  location  code_x
0     BBBB     term2  3      Sara   3.5   OH        AAAA,CCCC
1     PPPP     term4  1      Conroy 5.2   NM        QQQQ
2     XXXX     term1  1      Dan    3     FL        ZZZZ,ZZZZ
3     XXXX     term3  1      Dan    3     FL        XXXX,XXXX

There are two instances of term2 associated with Sara, so I want her count to be 6, not 3. The counts should be 2 for each of Dan's rows, too. If I understand the dataframe correctly, this is expected: the dataframe contains the data that satisfies the condition of the first term2. When the second instance is compared, the related data already exists in the dataframe.

How do I sum the count values where multiple codes are matched in the "allcode" field of df2?

EDIT: Adding desired outcome

      code_y   term   count  name   rate  location  code_x
0     BBBB     term2  6      Sara   3.5   OH        AAAA,CCCC
1     PPPP     term4  1      Conroy 5.2   NM        QQQQ
2     XXXX     term1  2      Dan    3     FL        ZZZZ,ZZZZ
3     XXXX     term3  2      Dan    3     FL        XXXX,XXXX

EDIT2: The problem is with the groupby.

Here is the output of df5 BEFORE the statement with the groupby:

    term code_x  count    name rate location code_y allcodes
0  term1   ZZZZ      1     Dan    3       FL   XXXX     ZZZZ
1  term1   ZZZZ      1     Dan    3       FL   XXXX     ZZZZ
2  term2   AAAA      3    Sara  3.5       OH   BBBB     AAAA
3  term3   XXXX      1     Dan    3       FL   XXXX     XXXX
4  term3   XXXX      1     Dan    3       FL   XXXX     XXXX
5  term4   QQQQ      1  Conroy  5.2       NM   PPPP     QQQQ
6  term2   CCCC      2    Sara  3.5       OH   BBBB     CCCC
7  term6   TTTT      1   Steve    3       NM   TTTT     TTTT

So, essentially I want counts for the column with the same term (term2 for Sara, and term3 & term4 for Dan) before the groupby.

Upvotes: 0

Views: 883

Answers (2)

BENY
BENY

Reputation: 323276

IIUC , By using str.split and len

df['count']*=df['code_x'].str.split(',').apply(len)
df
Out[1105]: 
  code_y   term  count    name  rate location     code_x
0   BBBB  term2      6    Sara   3.5       OH  AAAA,CCCC
1   PPPP  term4      1  Conroy   5.2       NM       QQQQ
2   XXXX  term1      2     Dan   3.0       FL  ZZZZ,ZZZZ
3   XXXX  term3      2     Dan   3.0       FL  XXXX,XXXX

EDIT :

    df.groupby(['code_y','term'],as_index=False).agg({'count':'sum','name':'first','rate':'first','location':'first','code_x':lambda x : ','.join(x),'allcodes':'first'})
Out[1130]: 
  code_y   term  count    name  rate location     code_x allcodes
0   BBBB  term2      5    Sara   3.5       OH  AAAA,CCCC     AAAA
1   PPPP  term4      1  Conroy   5.2       NM       QQQQ     QQQQ
2   TTTT  term6      1   Steve   3.0       NM       TTTT     TTTT
3   XXXX  term1      2     Dan   3.0       FL  ZZZZ,ZZZZ     ZZZZ
4   XXXX  term3      2     Dan   3.0       FL  XXXX,XXXX     XXXX

Upvotes: 2

Shijo
Shijo

Reputation: 9711

I created a small function to get the expected output.

def f(x):
     return pd.Series(dict(count = x['count'].sum(), 
                        code_x = "%s" % ', '.join(x['code_x'])))

df5=df5.groupby(['code_y','term','name','rate','location'])[['code_x','count']].apply(f).reset_index()

output

  code_y   term    name rate location      code_x  count
0   BBBB  term2    Sara  3.5       OH  AAAA, CCCC      6
1   PPPP  term4  Conroy  5.2       NM        QQQQ      1
2   XXXX  term1     Dan    3       FL  ZZZZ, ZZZZ      2
3   XXXX  term3     Dan    3       FL  XXXX, XXXX      2

Inspired by https://stackoverflow.com/a/17841294/6626530

Upvotes: 1

Related Questions