Derek Eden
Derek Eden

Reputation: 4628

how to determine if a numpy array or pandas row contains a string?

I have a pandas df of floats, but due to improper output/errors from the program from which I received the data, a number of rows contain values that are actually strings.

I want to remove these rows from the df with minimal looping. Ideally, I would like to mask all values in the df to those which are strings, and drop the row with True values. Another would be to iterate through each row and mask each individual row and delete if a True is in the mask. Worst case would be to loop over each row and also loop over each value to achieve the same task.

Can anyone advise how I could do this the most efficiently?

Something akin to df.iloc[x].istype(str) or something?

I tried df.loc[row_num].contains(str) as a futile attempt but didn't work.

I know I can loop over every single cell and do isinstance(cell,str) to check if it's a string but would really prefer some kind of masking technique.

As a side note to narrow down any solutions, I don't want to fix any string values to be floats, I just want to delete the entire row.

Thanks in advance.

Example of problematic row is below, notice the string with two decimals:

df.loc[516].values

array([890.0, 33.17, 29.64, 78.355, 80.182, 83.196, 86.721,
       90.12299999999999, 92.807, '91.705.099', 98.89, 99.007,
       99.34200000000001, 99.337, 100.43799999999999, 99.867, '100.625',
       100.712, 100.46, 100.427, 101.16799999999999, 100.904, 100.193,
       100.255, 100.537, 100.37100000000001, 100.535, 100.584, 101.52,
       101.787, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan], dtype=object)

Upvotes: 0

Views: 1502

Answers (3)

Stef
Stef

Reputation: 30589

Using isreal and all we can select all rows where all elements are real, i.e. int or float:

df[df.applymap(np.isreal).all(axis=1)]

Example:

df = pd.DataFrame({'a': [1,'2',3], 'b': [10,20,np.nan]})
df = df[df.applymap(np.isreal).all(axis=1)]

gives

   a     b
0  1  10.0
2  3   NaN

(caveat: this will filter out complex numbers too although they are numeric of course)

Upvotes: 1

Andy L.
Andy L.

Reputation: 25249

Try map and check type str

df.loc[516].map(type).eq(str).any()

It will return True if any cell in row 516 is type str

If you want to check whole df, just use applymap

df.applymap(type).eq(str).any(1)

It will return a series mask True/False for each row

Upvotes: 1

samredai
samredai

Reputation: 712

You could transpose the dataframe, then try to convert each column (which was originally a row) using the pd.to_numeric(). If there is a parse error because of a string that cannot be converted to an int or float, it will throw a ValueError. You can catch this exception and delete that column. Something like this:

df_transposed = df.T

for col in df_transposed:
    try:
        df_transposed[col] = pd.to_numeric(df_transposed[col])
    except ValueError:
        df_transposed = df_transposed.drop(columns=[col], axis=1)

df = df_transposed.T

Upvotes: 2

Related Questions