Reputation: 147
I have datasets consist of hundred thousand rows and hundreds column. All rows should be in float64 (data type) but a few of the rows recognized as object when I imported the file. I know maybe it contains some words in the rows, but I don't know exactly what the words is, it could be "deleted", "com fail" or whatever string that previous user type in. So how to find the string in the rows where a float should be? so I can remove it and convert it into float datatype
Ex:
df
date sensor1 sensor2 sensor3 sensor4
01/01/2020 100 65 66 89
01/02/2020 94 deleted 99 100
01/03/2020 89 77 hello 55
the strings maybe easy and can be detected by naked eyes if i have few hundreds row, but if you have million of it, what is the best practice and way to find the string?
Thank you.
Upvotes: 0
Views: 804
Reputation: 862511
For convert non numeric to missing values use to_numeric
with errors='coerce'
in DataFrame.apply
for processing all columns:
df1 = df.apply(pd.to_numeric, errors='coerce')
print (df1)
sensor1 sensor2 sensor3 sensor4
date
01/01/2020 100 65.0 66.0 89
01/02/2020 94 NaN 99.0 100
01/03/2020 89 77.0 NaN 55
For find rows with non numeric (if no missing values in original data) test if missing values after converting at least in one column by DataFrame.isna
and DataFrame.any
and filtering in boolean indexing
:
df2 = df[df.apply(pd.to_numeric, errors='coerce').isna().any(axis=1)]
print (df2)
sensor1 sensor2 sensor3 sensor4
date
01/02/2020 94 deleted 99 100
01/03/2020 89 77 hello 55
Upvotes: 1