Sum amount associated with item in a column of lists

Question

I have a dataframe as follows:

country   letter             keywords                 amount
   c         y     ['fruits', 'apples', "banana"]       700
   c         y     ["music", "dance", "banana"]         150
   c         y     ['loud', "dance", "apples"]          350

I'd like to sum the amount associated with each keyword. Note: country and letters are not always the same, as in the contrived data above. Also, the list of keywords vary in size.

I've tried several solutions. I've attached my fastest one below. I've also tried solutions with apply and defaultdicts...

keywords_list = []
for i in zip(*[df[c] for c in df.columns]):
    data = list(i[0:2])
    for k in i[2]:
        row = [k] + data + [i[-1]]
        keywords_list.append(row)

df_expanded = pd.DataFrame(keywords_list)
df_expanded.groupby(list(range(3)))[3].sum().reset_index()

Goal

  country letter keywords  amount
0       c      y   apples    1050
1       c      y   banana     850
2       c      y    dance     500
3       c      y   fruits     700
4       c      y     loud     350
5       c      y    music     150

Edit: corrected error in example of goal

Data

country = list("ccc")
letters = list("yyy")
keywords = [['fruits', 'apples', "banana"], ["music", "dance", "banana"], ['loud', "dance", "apples"]]
amount = [700, 150, 350]

df = pd.DataFrame({"country" : country, "keywords": keywords, "letter" : letters, "amount" : amount})
df = df[['country', 'letter', 'keywords', 'amount']]

jezrael · Accepted Answer

You can use:

df1 = pd.DataFrame(df.keywords.values.tolist())
       .stack()
       .reset_index(level=1, drop=True)
       .rename('keywords')
print (df1)
0    fruits
0    apples
0    banana
1     music
1     dance
1    banana
2      loud
2     dance
2    apples
Name: keywords, dtype: object

print (df.drop('keywords', axis=1).join(df1).reset_index(drop=True))
  country letter  amount keywords
0       c      y     700   fruits
1       c      y     700   apples
2       c      y     700   banana
3       c      y     150    music
4       c      y     150    dance
5       c      y     150   banana
6       c      y     350     loud
7       c      y     350    dance
8       c      y     350   apples

Another solution:

df = df.set_index(['country','letter','amount'])
df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index) \
        .stack() \
        .reset_index(name='keywords') \
        .drop('level_3',axis=1)
print (df1)
  country letter  amount keywords
0       c      y     700   fruits
1       c      y     700   apples
2       c      y     700   banana
3       c      y     150    music
4       c      y     150    dance
5       c      y     150   banana
6       c      y     350     loud
7       c      y     350    dance
8       c      y     350   apples

Then need groupby with aggregating sum:

print (df.groupby(['country','letter','keywords'], as_index=False)['amount'].sum())
  country letter keywords  amount
0       c      y   apples    1050
1       c      y   banana     850
2       c      y    dance     500
3       c      y   fruits     700
4       c      y     loud     350
5       c      y    music     150

Timings:

In [47]: %timeit (df.set_index(['country','letter','amount']).keywords.apply(pd.Series).stack().reset_index().drop('level_3',1))
1 loop, best of 3: 4.55 s per loop

In [48]: %timeit (jez1(df3))
10 loops, best of 3: 24.8 ms per loop

In [49]: %timeit (jez2(df3))
10 loops, best of 3: 29.7 ms per loop

Code for timings:

df = pd.concat([df]*10000).reset_index(drop=True)
df3 = df.copy()
df4 = df.copy()                

def jez1(df):
    df1 = pd.DataFrame(df.keywords.values.tolist()).stack().reset_index(level=1, drop=True).rename('keywords')
    return df.drop('keywords', axis=1).join(df1).reset_index(drop=True)

def jez2(df):
    df = df.set_index(['country','letter','amount'])
    df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index).stack().reset_index(name='keywords').drop('level_3',axis=1)
    return df1

Thank you MaxU for improvement with pop - then another drop is not necessary. Unfortunately timing failed (KeyError: 'keywords'), so I cannot compare it.

Sum amount associated with item in a column of lists

Answers (2)

Related Questions