Reputation: 87
I am trying to find the most pythonic way to tackle down my problem in the short time as possible since I am dealing with a large amount of data. My problem is the following:
I have two lists
a = [12,34,674,2,0,5,6,8]
b = ['foo','bar','bar','foo','foo','bar','foo','foo']
I want to say python: if 'bar' is in b, take all the indexes and sum all values in list a with those indexes.
This is what I have done so far:
idx = [i for i, j in enumerate(a) if j == 'bar']
but then I am stacked. I am considering using some wired for loops. Do you have any idea?
Upvotes: 2
Views: 1548
Reputation: 53029
Using np.bincount
. Computes both sums ('foo' and 'bar').
sum_foo, sum_bar = np.bincount(np.char.equal(b, 'bar'), a)
sum_foo
# 28.0
sum_bar
# 713.0
Note np.char.equal
works on both lists and arrays. If b is an array, then b == 'bar'
can be used instead and is a bit faster.
Timings:
Even though this computes both sums it is actually pretty fast:
timeit(lambda: np.bincount(b == 'bar', a))
# 2.406161994993454
Compare for example with the numpy masking method:
timeit(lambda: a[b == 'bar'].sum())
# 5.642918559984537
On larger arrays masking becomes slightly faster which is expected since bincount
does essentially 2x the work. Still bincount
takes less than 2x the time, so if you happen to need both sums ('foo' and 'bar'), bincount
is still faster.
aa = np.repeat(a, 1000)
bb = np.repeat(b, 1000)
timeit(lambda: aa[bb == 'bar'].sum(), number=1000)
# 0.07860603698645718
timeit(lambda:np.bincount(bb == 'bar', aa), number=1000)
# 0.11229897901648656
Upvotes: 3
Reputation: 393963
This is simple to do in pandas
:
In[5]:
import pandas as pd
a = [12,34,674,2,0,5,6,8]
b = ['foo','bar','bar','foo','foo','bar','foo','foo']
df = pd.DataFrame({'a':a, 'b':b})
df
Out[5]:
a b
0 12 foo
1 34 bar
2 674 bar
3 2 foo
4 0 foo
5 5 bar
6 6 foo
7 8 foo
In [8]: df.loc[df['b']=='bar','a'].sum()
Out[8]: 713
So here we take your lists and construct a dict
in place for the data
arg for the DataFrame
ctor:
df = pd.DataFrame({'a':a, 'b':b})
Then we just mask the df using loc
where we select the rows where 'b' == 'bar'
and select the column 'a'
and call sum()
:
df.loc[df['b']=='bar','a'].sum()
Upvotes: 0
Reputation: 18647
With numpy
:
import numpy as np
a = np.array(a)
b = np.array(b)
a[b == 'bar'].sum()
Upvotes: 4
Reputation: 71570
Use:
l = [x for x,y in zip(a,b) if y == 'bar']
If you want indexes:
l = [i for (i,x),y in zip(enumerate(a),b) if y == 'bar']
Upvotes: 0