lnNoam
lnNoam

Reputation: 1055

Sum amount associated with item in a column of lists

I have a dataframe as follows:

country   letter             keywords                 amount
   c         y     ['fruits', 'apples', "banana"]       700
   c         y     ["music", "dance", "banana"]         150
   c         y     ['loud', "dance", "apples"]          350

I'd like to sum the amount associated with each keyword. Note: country and letters are not always the same, as in the contrived data above. Also, the list of keywords vary in size.

I've tried several solutions. I've attached my fastest one below. I've also tried solutions with apply and defaultdicts...

keywords_list = []
for i in zip(*[df[c] for c in df.columns]):
    data = list(i[0:2])
    for k in i[2]:
        row = [k] + data + [i[-1]]
        keywords_list.append(row)

df_expanded = pd.DataFrame(keywords_list)
df_expanded.groupby(list(range(3)))[3].sum().reset_index()

Goal

  country letter keywords  amount
0       c      y   apples    1050
1       c      y   banana     850
2       c      y    dance     500
3       c      y   fruits     700
4       c      y     loud     350
5       c      y    music     150

Edit: corrected error in example of goal


Data

country = list("ccc")
letters = list("yyy")
keywords = [['fruits', 'apples', "banana"], ["music", "dance", "banana"], ['loud', "dance", "apples"]]
amount = [700, 150, 350]

df = pd.DataFrame({"country" : country, "keywords": keywords, "letter" : letters, "amount" : amount})
df = df[['country', 'letter', 'keywords', 'amount']]

Upvotes: 1

Views: 54

Answers (2)

jezrael
jezrael

Reputation: 862511

You can use:

df1 = pd.DataFrame(df.keywords.values.tolist())
       .stack()
       .reset_index(level=1, drop=True)
       .rename('keywords')
print (df1)
0    fruits
0    apples
0    banana
1     music
1     dance
1    banana
2      loud
2     dance
2    apples
Name: keywords, dtype: object

print (df.drop('keywords', axis=1).join(df1).reset_index(drop=True))
  country letter  amount keywords
0       c      y     700   fruits
1       c      y     700   apples
2       c      y     700   banana
3       c      y     150    music
4       c      y     150    dance
5       c      y     150   banana
6       c      y     350     loud
7       c      y     350    dance
8       c      y     350   apples

Another solution:

df = df.set_index(['country','letter','amount'])
df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index) \
        .stack() \
        .reset_index(name='keywords') \
        .drop('level_3',axis=1)
print (df1)
  country letter  amount keywords
0       c      y     700   fruits
1       c      y     700   apples
2       c      y     700   banana
3       c      y     150    music
4       c      y     150    dance
5       c      y     150   banana
6       c      y     350     loud
7       c      y     350    dance
8       c      y     350   apples

Then need groupby with aggregating sum:

print (df.groupby(['country','letter','keywords'], as_index=False)['amount'].sum())
  country letter keywords  amount
0       c      y   apples    1050
1       c      y   banana     850
2       c      y    dance     500
3       c      y   fruits     700
4       c      y     loud     350
5       c      y    music     150

Timings:

In [47]: %timeit (df.set_index(['country','letter','amount']).keywords.apply(pd.Series).stack().reset_index().drop('level_3',1))
1 loop, best of 3: 4.55 s per loop

In [48]: %timeit (jez1(df3))
10 loops, best of 3: 24.8 ms per loop

In [49]: %timeit (jez2(df3))
10 loops, best of 3: 29.7 ms per loop

Code for timings:

df = pd.concat([df]*10000).reset_index(drop=True)
df3 = df.copy()
df4 = df.copy()                

def jez1(df):
    df1 = pd.DataFrame(df.keywords.values.tolist()).stack().reset_index(level=1, drop=True).rename('keywords')
    return df.drop('keywords', axis=1).join(df1).reset_index(drop=True)

def jez2(df):
    df = df.set_index(['country','letter','amount'])
    df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index).stack().reset_index(name='keywords').drop('level_3',axis=1)
    return df1

Thank you MaxU for improvement with pop - then another drop is not necessary. Unfortunately timing failed (KeyError: 'keywords'), so I cannot compare it.

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

try this:

In [76]: (df.set_index(['country','letter','amount'])
    ...:    .keywords
    ...:    .apply(pd.Series)
    ...:    .stack()
    ...:    .reset_index(name='keywords')
    ...:    .drop('level_3',1)
    ...: )
    ...:
Out[76]:
  country letter  amount keywords
0       c      y     700   fruits
1       c      y     700   apples
2       c      y     700   banana
3       c      y     150    music
4       c      y     150    dance
5       c      y     150   banana
6       c      y     350     loud
7       c      y     350    dance
8       c      y     350   apples

Upvotes: 1

Related Questions