Reputation: 183

Pandas: Large CSV file data manipulation

I have a large dataset from a CSV file. It has two columns, the first is Date/Time in hh:mm:ss:ms form and the other is Pressure in number form. The pressure has randon values throughout that are not numerical values (things like 150+AA42BB43). These appear randomly throughout the 50,000 rows in the file and are not the same.

I need a way to change these Pressure values to numeric so I can perform data manipulation on them.

df_cleaned = df['Pressure'].loc[~df['Pressure'].map(lambda x: isinstance(x, float) | isinstance(x, int))]

I tried this, but it got rid of my Date/Time values and also did not clean all the pressure values while also getting rid of my headers.

I was wondering if anyone had any suggestions on how I can easily clean the data in the 2nd column while also keeping my Date/Time values in the first column accurate.

Upvotes: 0

Answers (2)

furas

Reputation: 143097

Your problem is that you use

df_cleaned = df['Pressure']

and this get only one column (Pressure) and skip other columns. And when you get single column then it may gives you Series instead of DataFrame - and Series can keep only one column so it doesn't need header to select columns.

You should run it without ['Pressure']

df_cleaned = df.loc[ ~df['Pressure'].map(...) ]

or even

df_cleaned = df[ ~df['Pressure'].map(...) ]

BTW: shorter isinstance(x, (float, int))

But using isinstance may not work if you have float/int values as strings - because isinstance("123", (float, int)) gives False - and you would have to rather try to convert float("123") and int("123") and catch error.

import pandas as pd

data = {
    'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'], 
    'Pressure': [78, '150+AA42BB43', 23], 
}

df = pd.DataFrame(data)

df_cleaned = df[ df['Pressure'].map(lambda x:isinstance(x, (float, int))) ]

print(df_cleaned)

Result:

     DateTime Pressure
0  2021.10.04       78
2  2021.10.06       23

EDIT:

If you have values as string then you can use to_numeric to convert them and put NaN if value can't be converted

df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')

and then you can filter it usinf isna()

df_cleaned = df[ ~df['Pressure'].isna() ]

import pandas as pd

data = {
    'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'], 
    'Pressure': ['78.2', '150+AA42BB43', '23'], 
}

df = pd.DataFrame(data)

df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')
print(df)

df_cleaned = df[ ~df['Pressure'].isna() ]
print(df_cleaned)

Upvotes: 2

Maria

Reputation: 397

I think I've an answer if all your non-numerical values are strings.

Have you tried using pandas replace()? Something like:

df['Pressure'].replace(to_replace = r'.+', value=0, inplace=True, regex=True)

I've used a regex to determine "any string". inplace=True allows that the existing dataframe is modified, instead of creating a new one.

Here, the function will replace any string with a given integer. I am not sure which integer you'd like to put there, so I've just used zero as an example. If you'd like a different integer for each string, you could use a map as explained in this answer.

Upvotes: 1

Pandas: Large CSV file data manipulation

Answers (2)

Related Questions