Reputation: 2656
My Numpy array contains 10 columns and around 2 million rows.
Now I need to analyze each column separately, find values which are outliers; and delete the entire corresponding row from the array.
So I'd start analyzing column 0; find outliers at Row 10,20,100; and remove these rows. Next I'd start analyzing column 1 in the now trimmed array; and apply the same process.
Of course I can think of a normal manual process to do this (iterate through each column, find indices which are outliers, delete row, proceed to other column), but I've always found that Numpy contains some quick nifty tricks to accomplish statistical tasks like these.
And if you could elaborate a bit on the runtime cost of the method; even better.
I'm not restricted to the NumPy library here, if SciPy has something helpful then no issues using it.
Thanks!
Upvotes: 3
Views: 3663
Reputation: 67507
Two very straightforward approaches, the second with a little more sophistication:
arr = np.random.randn(2e6, 10)
def remove_outliers(arr, k):
mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
return arr[np.all(np.abs((arr - mu) / sigma) < k, axis=1)]
def remove_outliers_bis(arr, k):
mask = np.ones((arr.shape[0],), dtype=np.bool)
mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
for j in range(arr.shape[1]):
col = arr[:, j]
mask[mask] &= np.abs((col[mask] - mu[j]) / sigma[j]) < k
return arr[mask]
Performance depends of how many outliers you have:
In [38]: %timeit remove_outliers(arr, 1)
1 loops, best of 3: 1.13 s per loop
In [39]: %timeit remove_outliers_bis(arr, 1)
1 loops, best of 3: 983 ms per loop
In [40]: %timeit remove_outliers(arr, 2)
1 loops, best of 3: 1.21 s per loop
In [41]: %timeit remove_outliers_bis(arr, 2)
1 loops, best of 3: 1.51 s per loop
And of course:
In [42]: np.allclose(remove_outliers(arr, 1), remove_outliers_bis(arr, 1))
Out[42]: True
In [43]: np.allclose(remove_outliers(arr, 2), remove_outliers_bis(arr, 2))
Out[43]: True
I would say that the complication of the second method does not justify its potential speed-up, but YMMV...
Upvotes: 4
Reputation: 23550
The best-performing solution depends on the relative cost of finding an outlier, deleting a row, and on the frequency of outliers.
If your outlier frequency is not very high, I would do as follows:
Deleting rows one-by-one takes a lot of time, and if outlier-finding is not very expensive the extra work due to possible finding of several outliers in the same row is not significant.
As a code this would be something like:
outliers = find_outliers(data)
data_without_outliers = data[outliers.sum(axis=1) == 0]
where find_outliers
creates a boolean table of outlier status (i.e. True
if the corresponding element in the original array data
is an outlier).
My guess is that the performance depends on your outlier-detection algorithm. If you can make it simple and vectorized, then this is fast.
Upvotes: 0