user1265125
user1265125

Reputation: 2656

Removing outliers in each column (and corresponding row)

My Numpy array contains 10 columns and around 2 million rows.

Now I need to analyze each column separately, find values which are outliers; and delete the entire corresponding row from the array.

So I'd start analyzing column 0; find outliers at Row 10,20,100; and remove these rows. Next I'd start analyzing column 1 in the now trimmed array; and apply the same process.

Of course I can think of a normal manual process to do this (iterate through each column, find indices which are outliers, delete row, proceed to other column), but I've always found that Numpy contains some quick nifty tricks to accomplish statistical tasks like these.

And if you could elaborate a bit on the runtime cost of the method; even better.

I'm not restricted to the NumPy library here, if SciPy has something helpful then no issues using it.

Thanks!

Upvotes: 3

Views: 3663

Answers (2)

Jaime
Jaime

Reputation: 67507

Two very straightforward approaches, the second with a little more sophistication:

arr = np.random.randn(2e6, 10)

def remove_outliers(arr, k):
    mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
    return arr[np.all(np.abs((arr - mu) / sigma) < k, axis=1)]

def remove_outliers_bis(arr, k):
    mask = np.ones((arr.shape[0],), dtype=np.bool)
    mu, sigma = np.mean(arr, axis=0), np.std(arr, axis=0, ddof=1)
    for j in range(arr.shape[1]):
        col = arr[:, j]
        mask[mask] &= np.abs((col[mask] - mu[j]) / sigma[j]) < k
    return arr[mask]

Performance depends of how many outliers you have:

In [38]: %timeit remove_outliers(arr, 1)
1 loops, best of 3: 1.13 s per loop

In [39]: %timeit remove_outliers_bis(arr, 1)
1 loops, best of 3: 983 ms per loop

In [40]: %timeit remove_outliers(arr, 2)
1 loops, best of 3: 1.21 s per loop

In [41]: %timeit remove_outliers_bis(arr, 2)
1 loops, best of 3: 1.51 s per loop

And of course:

In [42]: np.allclose(remove_outliers(arr, 1), remove_outliers_bis(arr, 1))
Out[42]: True

In [43]: np.allclose(remove_outliers(arr, 2), remove_outliers_bis(arr, 2))
Out[43]: True

I would say that the complication of the second method does not justify its potential speed-up, but YMMV...

Upvotes: 4

DrV
DrV

Reputation: 23550

The best-performing solution depends on the relative cost of finding an outlier, deleting a row, and on the frequency of outliers.

If your outlier frequency is not very high, I would do as follows:

  • create a boolean table of outliers (one element for each element in the original table)
  • sum the table along axis (sum of each row)
  • create a new table where there are only the rows where the outlier sum is 0

Deleting rows one-by-one takes a lot of time, and if outlier-finding is not very expensive the extra work due to possible finding of several outliers in the same row is not significant.

As a code this would be something like:

outliers = find_outliers(data)
data_without_outliers = data[outliers.sum(axis=1) == 0]

where find_outliers creates a boolean table of outlier status (i.e. True if the corresponding element in the original array data is an outlier).

My guess is that the performance depends on your outlier-detection algorithm. If you can make it simple and vectorized, then this is fast.

Upvotes: 0

Related Questions