Reputation: 1055
I have a dataframe as follows:
country letter keywords amount
c y ['fruits', 'apples', "banana"] 700
c y ["music", "dance", "banana"] 150
c y ['loud', "dance", "apples"] 350
I'd like to sum the amount associated with each keyword.
Note: country
and letters
are not always the same, as in the contrived data above. Also, the list of keywords
vary in size.
I've tried several solutions. I've attached my fastest one below. I've also tried solutions with apply
and defaultdicts
...
keywords_list = []
for i in zip(*[df[c] for c in df.columns]):
data = list(i[0:2])
for k in i[2]:
row = [k] + data + [i[-1]]
keywords_list.append(row)
df_expanded = pd.DataFrame(keywords_list)
df_expanded.groupby(list(range(3)))[3].sum().reset_index()
Goal
country letter keywords amount
0 c y apples 1050
1 c y banana 850
2 c y dance 500
3 c y fruits 700
4 c y loud 350
5 c y music 150
Edit: corrected error in example of goal
Data
country = list("ccc")
letters = list("yyy")
keywords = [['fruits', 'apples', "banana"], ["music", "dance", "banana"], ['loud', "dance", "apples"]]
amount = [700, 150, 350]
df = pd.DataFrame({"country" : country, "keywords": keywords, "letter" : letters, "amount" : amount})
df = df[['country', 'letter', 'keywords', 'amount']]
Upvotes: 1
Views: 54
Reputation: 862511
You can use:
df1 = pd.DataFrame(df.keywords.values.tolist())
.stack()
.reset_index(level=1, drop=True)
.rename('keywords')
print (df1)
0 fruits
0 apples
0 banana
1 music
1 dance
1 banana
2 loud
2 dance
2 apples
Name: keywords, dtype: object
print (df.drop('keywords', axis=1).join(df1).reset_index(drop=True))
country letter amount keywords
0 c y 700 fruits
1 c y 700 apples
2 c y 700 banana
3 c y 150 music
4 c y 150 dance
5 c y 150 banana
6 c y 350 loud
7 c y 350 dance
8 c y 350 apples
Another solution:
df = df.set_index(['country','letter','amount'])
df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index) \
.stack() \
.reset_index(name='keywords') \
.drop('level_3',axis=1)
print (df1)
country letter amount keywords
0 c y 700 fruits
1 c y 700 apples
2 c y 700 banana
3 c y 150 music
4 c y 150 dance
5 c y 150 banana
6 c y 350 loud
7 c y 350 dance
8 c y 350 apples
Then need groupby
with aggregating sum
:
print (df.groupby(['country','letter','keywords'], as_index=False)['amount'].sum())
country letter keywords amount
0 c y apples 1050
1 c y banana 850
2 c y dance 500
3 c y fruits 700
4 c y loud 350
5 c y music 150
Timings:
In [47]: %timeit (df.set_index(['country','letter','amount']).keywords.apply(pd.Series).stack().reset_index().drop('level_3',1))
1 loop, best of 3: 4.55 s per loop
In [48]: %timeit (jez1(df3))
10 loops, best of 3: 24.8 ms per loop
In [49]: %timeit (jez2(df3))
10 loops, best of 3: 29.7 ms per loop
Code for timings:
df = pd.concat([df]*10000).reset_index(drop=True)
df3 = df.copy()
df4 = df.copy()
def jez1(df):
df1 = pd.DataFrame(df.keywords.values.tolist()).stack().reset_index(level=1, drop=True).rename('keywords')
return df.drop('keywords', axis=1).join(df1).reset_index(drop=True)
def jez2(df):
df = df.set_index(['country','letter','amount'])
df1 = pd.DataFrame(df.keywords.values.tolist(), index = df.index).stack().reset_index(name='keywords').drop('level_3',axis=1)
return df1
Thank you MaxU
for improvement with pop
- then another drop
is not necessary. Unfortunately timing
failed (KeyError: 'keywords'
), so I cannot compare it.
Upvotes: 2
Reputation: 210832
try this:
In [76]: (df.set_index(['country','letter','amount'])
...: .keywords
...: .apply(pd.Series)
...: .stack()
...: .reset_index(name='keywords')
...: .drop('level_3',1)
...: )
...:
Out[76]:
country letter amount keywords
0 c y 700 fruits
1 c y 700 apples
2 c y 700 banana
3 c y 150 music
4 c y 150 dance
5 c y 150 banana
6 c y 350 loud
7 c y 350 dance
8 c y 350 apples
Upvotes: 1