Luca F.
Luca F.

Reputation: 87

Pandas - Sum values in list according to index from another list

I am trying to find the most pythonic way to tackle down my problem in the short time as possible since I am dealing with a large amount of data. My problem is the following:

I have two lists

a = [12,34,674,2,0,5,6,8]
b = ['foo','bar','bar','foo','foo','bar','foo','foo']

I want to say python: if 'bar' is in b, take all the indexes and sum all values in list a with those indexes.

This is what I have done so far:

idx = [i for i, j in enumerate(a) if j == 'bar'] 

but then I am stacked. I am considering using some wired for loops. Do you have any idea?

Upvotes: 2

Views: 1548

Answers (4)

Paul Panzer
Paul Panzer

Reputation: 53029

Using np.bincount. Computes both sums ('foo' and 'bar').

sum_foo, sum_bar = np.bincount(np.char.equal(b, 'bar'), a)
sum_foo
# 28.0
sum_bar
# 713.0

Note np.char.equal works on both lists and arrays. If b is an array, then b == 'bar' can be used instead and is a bit faster.

Timings:

Even though this computes both sums it is actually pretty fast:

timeit(lambda: np.bincount(b == 'bar', a))
# 2.406161994993454

Compare for example with the numpy masking method:

timeit(lambda: a[b == 'bar'].sum())
# 5.642918559984537

On larger arrays masking becomes slightly faster which is expected since bincount does essentially 2x the work. Still bincount takes less than 2x the time, so if you happen to need both sums ('foo' and 'bar'), bincount is still faster.

aa = np.repeat(a, 1000)
bb = np.repeat(b, 1000)
timeit(lambda: aa[bb == 'bar'].sum(), number=1000)
# 0.07860603698645718
timeit(lambda:np.bincount(bb == 'bar', aa), number=1000)
# 0.11229897901648656

Upvotes: 3

EdChum
EdChum

Reputation: 393963

This is simple to do in pandas:

In[5]:
import pandas as pd
a = [12,34,674,2,0,5,6,8]
b = ['foo','bar','bar','foo','foo','bar','foo','foo']
df = pd.DataFrame({'a':a, 'b':b})
df

Out[5]: 
     a    b
0   12  foo
1   34  bar
2  674  bar
3    2  foo
4    0  foo
5    5  bar
6    6  foo
7    8  foo

In [8]: df.loc[df['b']=='bar','a'].sum()
Out[8]: 713

So here we take your lists and construct a dict in place for the data arg for the DataFrame ctor:

df = pd.DataFrame({'a':a, 'b':b})

Then we just mask the df using loc where we select the rows where 'b' == 'bar' and select the column 'a' and call sum():

df.loc[df['b']=='bar','a'].sum()

Upvotes: 0

Chris Adams
Chris Adams

Reputation: 18647

With numpy:

import numpy as np

a = np.array(a)
b = np.array(b)

a[b == 'bar'].sum()

Upvotes: 4

U13-Forward
U13-Forward

Reputation: 71570

Use:

l = [x for x,y in zip(a,b) if y == 'bar']

If you want indexes:

l = [i for (i,x),y in zip(enumerate(a),b) if y == 'bar']

Upvotes: 0

Related Questions