Pandas: sum values in column if multiple values match str/contains/join

Question

I am having trouble searching one list of numbers from one dataframe in the column of another. I am dealing with a case where different numbers refer to the same item. I am having trouble searching one list of numbers from one dataframe in the column of another. I am dealing with a case where different numbers refer to the same item.

In my sample code, "term2" has entries with both code 'AAAA' and 'CCCC'.

    df1 = pd.DataFrame(
        {
        'term' : ['term1','term2','term3','term4','term1','term3','term2'],
        'code': ['ZZZZ', 'AAAA','XXXX','QQQQ','ZZZZ','XXXX','CCCC'],
        'count':['1','3','1','1','1','1','3']
        }
        )
df1 = df1[['term'] + df1.columns[:-1].tolist()]
df2 = pd.DataFrame(
        {
        'name': ['Dan','Sara','Conroy','Steve'],
        'rate': ['3','3.5','5.2','3'],
        'location': ['FL','OH','NM','NM'],
        'code': ['XXXX','BBBB','PPPP','TTTT'],
        'allcodes': ['XXXX,YYYY,ZZZZ','AAAA,BBBB,CCCC','PPPP,QQQQ,RRRR','SSSS,TTTT,UUUU']
         }
         )

df2 = df2[['name','rate','location','code','allcodes']]
pmissnlist = df1['code'].values.tolist()
df3 = df2[df2['allcodes'].str.contains('|'.join(pmissnlist))].reset_index()

df4 = pd.DataFrame(df3['allcodes'].str.split(',').tolist(),
                            index=df3.code).stack(level=0).reset_index(level=0)
matches = df4[df4[0].isin(df1.code)]
df5 = df3[df3.code.isin(matches.code)]
df5 = df5.drop_duplicates()#drops duplicate rows

df5_temp = df5.set_index(['name','rate','location','code']).allcodes.str.split(',',expand = True)\
                                  .stack().reset_index(4,drop = True).reset_index(name = 'allcodes')
df5 = pd.merge(df1, df5_temp, left_on = 'code', right_on = 'allcodes')
df5 = df5.drop('allcodes', 1).groupby(['code_y','term','count','name','rate','location']).code_x.\
                                    apply(','.join).reset_index().sort_values(by='count', ascending=False)
print(df5)

The output is:

      code_y   term   count  name   rate  location  code_x
0     BBBB     term2  3      Sara   3.5   OH        AAAA,CCCC
1     PPPP     term4  1      Conroy 5.2   NM        QQQQ
2     XXXX     term1  1      Dan    3     FL        ZZZZ,ZZZZ
3     XXXX     term3  1      Dan    3     FL        XXXX,XXXX

There are two instances of term2 associated with Sara, so I want her count to be 6, not 3. The counts should be 2 for each of Dan's rows, too. If I understand the dataframe correctly, this is expected: the dataframe contains the data that satisfies the condition of the first term2. When the second instance is compared, the related data already exists in the dataframe.

How do I sum the count values where multiple codes are matched in the "allcode" field of df2?

EDIT: Adding desired outcome

      code_y   term   count  name   rate  location  code_x
0     BBBB     term2  6      Sara   3.5   OH        AAAA,CCCC
1     PPPP     term4  1      Conroy 5.2   NM        QQQQ
2     XXXX     term1  2      Dan    3     FL        ZZZZ,ZZZZ
3     XXXX     term3  2      Dan    3     FL        XXXX,XXXX

EDIT2: The problem is with the groupby.

Here is the output of df5 BEFORE the statement with the groupby:

    term code_x  count    name rate location code_y allcodes
0  term1   ZZZZ      1     Dan    3       FL   XXXX     ZZZZ
1  term1   ZZZZ      1     Dan    3       FL   XXXX     ZZZZ
2  term2   AAAA      3    Sara  3.5       OH   BBBB     AAAA
3  term3   XXXX      1     Dan    3       FL   XXXX     XXXX
4  term3   XXXX      1     Dan    3       FL   XXXX     XXXX
5  term4   QQQQ      1  Conroy  5.2       NM   PPPP     QQQQ
6  term2   CCCC      2    Sara  3.5       OH   BBBB     CCCC
7  term6   TTTT      1   Steve    3       NM   TTTT     TTTT

So, essentially I want counts for the column with the same term (term2 for Sara, and term3 & term4 for Dan) before the groupby.

Shijo · Accepted Answer

I created a small function to get the expected output.

def f(x):
     return pd.Series(dict(count = x['count'].sum(), 
                        code_x = "%s" % ', '.join(x['code_x'])))

df5=df5.groupby(['code_y','term','name','rate','location'])[['code_x','count']].apply(f).reset_index()

output

  code_y   term    name rate location      code_x  count
0   BBBB  term2    Sara  3.5       OH  AAAA, CCCC      6
1   PPPP  term4  Conroy  5.2       NM        QQQQ      1
2   XXXX  term1     Dan    3       FL  ZZZZ, ZZZZ      2
3   XXXX  term3     Dan    3       FL  XXXX, XXXX      2

Inspired by https://stackoverflow.com/a/17841294/6626530

Pandas: sum values in column if multiple values match str/contains/join

Answers (2)

Related Questions