andreaconsole
andreaconsole

Reputation: 452

dealing with arrays: how to avoid a "for" statement

I have a 100000000x2 array named "a", with an index in the first column and a related value in the second column. I need to get the median values of the numbers in the second column for each index. This is how I colud do it with a for statement:

import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
              [1, 3],
              [2, 3],
              [2, 4],
              [2, 6],
              [1, 4],
              ...
              ...
              [1000000,6]])
for i in xrange(1000000):
    b[i]=np.median(a[np.where(a[:,0]==i),1])

Obviously it's too slow with the for iteration: any suggestions? Thanks

Upvotes: 6

Views: 253

Answers (5)

seberg
seberg

Reputation: 8975

This is a little bit annoying to do, but at least you can remove that annoying == easily, using sorting (and thats probably your speed killer). Trying more is probably not very useful, though it might be possible if you sort yourself, etc.:

# First sor the whole thing (probably other ways):
sorter = np.argsort(a[:,0]) # sort by class.
a = a[sorter] # sorted version of a

# Now we need to find where there are changes in the class:
w = np.where(a[:-1,0] != a[1:,0])[0] + 1 # Where the class changes.
# for simplicity, append [0] and [len(a)] to have full slices...
w = np.concatenate([0], w, [len(a)])
result = np.zeros(len(w)-1, dtype=a.dtype)
for i in xrange(0, len(w)-1):
    result[0] = np.median(a[w[i]:w[i+1]])

# If the classes are not exactly 1, 2, ..., N we could add class information:
classes = a[w[:-1],0]

If all your classes are the same size, so there are exactly as many 1s, as 2s, etc. There are better ways though.

EDIT: Check Bitwises version for a solution to avoid the last for loop as well (he also hides some of this code into np.unique which you may prefere, since speed should not matter for that anyways).

Upvotes: 4

Bitwise
Bitwise

Reputation: 7807

Here is my version, no for and no additional modules. The idea is to sort the array once and then you can easily get the indices of the medians just by counting the indices in the first column of a:

# sort by first column and then by second
b=a[np.lexsort((a[:,1],a[:,0]))]

# find central value for each index
c=np.unique(b[:,0],return_index=True)[1]
d=np.r_[c,len(a)]
inds=(d[1:]+d[:-1]-1)/2.0
# final result (as suggested by seberg)
medians=np.mean(np.c_[b[np.floor(inds).astype(int),1],
                      b[np.ceil(inds).astype(int),1]],1)

# inds is the index of the median value for each key

You can shorten the code if you like.

Upvotes: 3

keflavich
keflavich

Reputation: 19205

A quick 1-line approach:

result = [np.median(a[a[:,0]==ii,1]) for ii in np.unique(a[:,0])]

I'm not convinced there's much you can do to make that go faster without sacrificing accuracy. But here's another attempt, which might be faster if you can skip the sort step:

num_in_ind = np.bincount(a[:,0])
results = [np.sort(a[a[:,0]==ii,1])[num_in_ind[ii]/2] for ii in np.unique(a[:,0])]

The latter is very slightly faster for small arrays. Not sure if it's fast enough.

Upvotes: 1

BrenBarn
BrenBarn

Reputation: 251355

If you find yourself wanting to do this a lot, I would recommend you look at the pandas library, which makes this as easy as pie:

>>> df = pandas.DataFrame([["A", 1], ["B", 2], ["A", 3], ["A", 4], ["B", 5]], columns=["One", "Two"])
>>> print df
  One  Two
0   A    1
1   B    2
2   A    3
3   A    4
4   B    5
>>> df.groupby('One').median()
      Two
One     
A    3.0
B    3.5

Upvotes: 2

Warren Weckesser
Warren Weckesser

Reputation: 114791

This is known as a "group by" operation. Pandas (http://pandas.pydata.org/) is a good tool for this:

import numpy as np
import pandas as pd

a = np.array([[1.0, 2.0],
              [1.0, 3.0],
              [2.0, 5.0],
              [2.0, 6.0],
              [2.0, 8.0],
              [1.0, 4.0],
              [1.0, 1.0],
              [1.0, 3.5],
              [5.0, 8.0],
              [2.0, 1.0],
              [5.0, 9.0]])

# Create the pandas DataFrame.
df = pd.DataFrame(a, columns=['index', 'value'])

# Form the groups.
grouped = df.groupby('index')

# `result` is the DataFrame containing the aggregated results.
result = grouped.aggregate(np.median)
print result

Output:

       value
index       
1        3.0
2        5.5
5        8.5

There are ways to create the DataFrame containing the original data directly, so you don't necessarily have to create the numpy array a first.

More information about the groupby operation in Pandas: http://pandas.pydata.org/pandas-docs/dev/groupby.html

Upvotes: 6

Related Questions