zebra
zebra

Reputation: 6531

How to remove all rows in a numpy.ndarray that contain non-numeric values

I read in a dataset as a numpy.ndarray and some of the values are missing (either by just not being there, being NaN, or by being a string written "NA").

I want to clean out all rows containing any entry like this. How do I do that with a numpy ndarray?

Upvotes: 122

Views: 92764

Answers (2)

cottontail
cottontail

Reputation: 23321

You can also use a masked array via np.ma.fix_invalid to create a mask and filter out "bad" values (such as NaN, inf).

arr = np.array([
    [0, 1, np.inf],
    [2.2, 3.3, 4.],
    [np.nan, 5.5, 6],
    [7.8, -np.inf, 9.9],
    [10, 11, 12]
])

new_arr = arr[~np.ma.fix_invalid(arr).mask.any(axis=1)]

# array([[ 2.2,  3.3,  4. ],
#        [10. , 11. , 12. ]])

If the array contains strings such as 'NA', then np.where may be useful to "mask" these values and then filter them out.

arr = np.array([
    [0, 1, 'N/A'],
    [2.2, 3.3, 4.],
    [np.nan, 5.5, 6],
    [7.8, 'NA', 9.9],
    [10, 11, 12]
], dtype=object)

tmp = np.where(np.isin(arr, ['NA', 'N/A']), np.nan, arr).astype(float)
new_arr = tmp[~np.isnan(tmp).any(axis=1)]

# array([[ 2.2,  3.3,  4. ],
#        [10. , 11. , 12. ]])

Upvotes: 1

eumiro
eumiro

Reputation: 213025

>>> a = np.array([[1,2,3], [4,5,np.nan], [7,8,9]])
array([[  1.,   2.,   3.],
       [  4.,   5.,  nan],
       [  7.,   8.,   9.]])

>>> a[~np.isnan(a).any(axis=1)]
array([[ 1.,  2.,  3.],
       [ 7.,  8.,  9.]])

and reassign this to a.

Explanation: np.isnan(a) returns a similar array with True where NaN, False elsewhere. .any(axis=1) reduces an m*n array to n with an logical or operation on the whole rows, ~ inverts True/False and a[ ] chooses just the rows from the original array, which have True within the brackets.

Upvotes: 198

Related Questions