Reputation: 183
I have a large dataset from a CSV file. It has two columns, the first is Date/Time in hh:mm:ss:ms form and the other is Pressure in number form. The pressure has randon values throughout that are not numerical values (things like 150+AA42BB43). These appear randomly throughout the 50,000 rows in the file and are not the same.
I need a way to change these Pressure values to numeric so I can perform data manipulation on them.
df_cleaned = df['Pressure'].loc[~df['Pressure'].map(lambda x: isinstance(x, float) | isinstance(x, int))]
I tried this, but it got rid of my Date/Time values and also did not clean all the pressure values while also getting rid of my headers.
I was wondering if anyone had any suggestions on how I can easily clean the data in the 2nd column while also keeping my Date/Time values in the first column accurate.
Upvotes: 0
Views: 228
Reputation: 143097
Your problem is that you use
df_cleaned = df['Pressure']
and this get only one column (Pressure
) and skip other columns. And when you get single column then it may gives you Series
instead of DataFrame
- and Series
can keep only one column so it doesn't need header
to select columns.
You should run it without ['Pressure']
df_cleaned = df.loc[ ~df['Pressure'].map(...) ]
or even
df_cleaned = df[ ~df['Pressure'].map(...) ]
BTW: shorter isinstance(x, (float, int))
But using isinstance
may not work if you have float/int
values as strings - because isinstance("123", (float, int))
gives False
- and you would have to rather try to convert float("123")
and int("123")
and catch error.
import pandas as pd
data = {
'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'],
'Pressure': [78, '150+AA42BB43', 23],
}
df = pd.DataFrame(data)
df_cleaned = df[ df['Pressure'].map(lambda x:isinstance(x, (float, int))) ]
print(df_cleaned)
Result:
DateTime Pressure
0 2021.10.04 78
2 2021.10.06 23
EDIT:
If you have values as string then you can use to_numeric
to convert them and put NaN
if value can't be converted
df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')
and then you can filter it usinf isna()
df_cleaned = df[ ~df['Pressure'].isna() ]
import pandas as pd
data = {
'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'],
'Pressure': ['78.2', '150+AA42BB43', '23'],
}
df = pd.DataFrame(data)
df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')
print(df)
df_cleaned = df[ ~df['Pressure'].isna() ]
print(df_cleaned)
Upvotes: 2
Reputation: 397
I think I've an answer if all your non-numerical values are strings.
Have you tried using pandas replace()? Something like:
df['Pressure'].replace(to_replace = r'.+', value=0, inplace=True, regex=True)
I've used a regex to determine "any string". inplace=True
allows that the existing dataframe is modified, instead of creating a new one.
Here, the function will replace any string with a given integer. I am not sure which integer you'd like to put there, so I've just used zero as an example. If you'd like a different integer for each string, you could use a map as explained in this answer.
Upvotes: 1