Reputation: 452
I have a 100000000x2 array named "a", with an index in the first column and a related value in the second column. I need to get the median values of the numbers in the second column for each index. This is how I colud do it with a for statement:
import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
[1, 3],
[2, 3],
[2, 4],
[2, 6],
[1, 4],
...
...
[1000000,6]])
for i in xrange(1000000):
b[i]=np.median(a[np.where(a[:,0]==i),1])
Obviously it's too slow with the for iteration: any suggestions? Thanks
Upvotes: 6
Views: 253
Reputation: 8975
This is a little bit annoying to do, but at least you can remove that annoying ==
easily, using sorting (and thats probably your speed killer). Trying more is probably not very useful, though it might be possible if you sort yourself, etc.:
# First sor the whole thing (probably other ways):
sorter = np.argsort(a[:,0]) # sort by class.
a = a[sorter] # sorted version of a
# Now we need to find where there are changes in the class:
w = np.where(a[:-1,0] != a[1:,0])[0] + 1 # Where the class changes.
# for simplicity, append [0] and [len(a)] to have full slices...
w = np.concatenate([0], w, [len(a)])
result = np.zeros(len(w)-1, dtype=a.dtype)
for i in xrange(0, len(w)-1):
result[0] = np.median(a[w[i]:w[i+1]])
# If the classes are not exactly 1, 2, ..., N we could add class information:
classes = a[w[:-1],0]
If all your classes are the same size, so there are exactly as many 1s, as 2s, etc. There are better ways though.
EDIT: Check Bitwises version for a solution to avoid the last for loop as well (he also hides some of this code into np.unique
which you may prefere, since speed should not matter for that anyways).
Upvotes: 4
Reputation: 7807
Here is my version, no for and no additional modules. The idea is to sort the array once and then you can easily get the indices of the medians just by counting the indices in the first column of a:
# sort by first column and then by second
b=a[np.lexsort((a[:,1],a[:,0]))]
# find central value for each index
c=np.unique(b[:,0],return_index=True)[1]
d=np.r_[c,len(a)]
inds=(d[1:]+d[:-1]-1)/2.0
# final result (as suggested by seberg)
medians=np.mean(np.c_[b[np.floor(inds).astype(int),1],
b[np.ceil(inds).astype(int),1]],1)
# inds is the index of the median value for each key
You can shorten the code if you like.
Upvotes: 3
Reputation: 19205
A quick 1-line approach:
result = [np.median(a[a[:,0]==ii,1]) for ii in np.unique(a[:,0])]
I'm not convinced there's much you can do to make that go faster without sacrificing accuracy. But here's another attempt, which might be faster if you can skip the sort step:
num_in_ind = np.bincount(a[:,0])
results = [np.sort(a[a[:,0]==ii,1])[num_in_ind[ii]/2] for ii in np.unique(a[:,0])]
The latter is very slightly faster for small arrays. Not sure if it's fast enough.
Upvotes: 1
Reputation: 251355
If you find yourself wanting to do this a lot, I would recommend you look at the pandas library, which makes this as easy as pie:
>>> df = pandas.DataFrame([["A", 1], ["B", 2], ["A", 3], ["A", 4], ["B", 5]], columns=["One", "Two"])
>>> print df
One Two
0 A 1
1 B 2
2 A 3
3 A 4
4 B 5
>>> df.groupby('One').median()
Two
One
A 3.0
B 3.5
Upvotes: 2
Reputation: 114791
This is known as a "group by" operation. Pandas (http://pandas.pydata.org/) is a good tool for this:
import numpy as np
import pandas as pd
a = np.array([[1.0, 2.0],
[1.0, 3.0],
[2.0, 5.0],
[2.0, 6.0],
[2.0, 8.0],
[1.0, 4.0],
[1.0, 1.0],
[1.0, 3.5],
[5.0, 8.0],
[2.0, 1.0],
[5.0, 9.0]])
# Create the pandas DataFrame.
df = pd.DataFrame(a, columns=['index', 'value'])
# Form the groups.
grouped = df.groupby('index')
# `result` is the DataFrame containing the aggregated results.
result = grouped.aggregate(np.median)
print result
Output:
value
index
1 3.0
2 5.5
5 8.5
There are ways to create the DataFrame containing the original data directly, so you don't necessarily have to create the numpy array a
first.
More information about the groupby operation in Pandas: http://pandas.pydata.org/pandas-docs/dev/groupby.html
Upvotes: 6