Not all cells changed by pandas.to_numeric

Question

I have a dataframe with 30 columns and 1000 rows. It is read using:

df = pd.read_excel(filepath, sheet_name = "sheetname", header = 2)

Within the excel floats are comma seperated, but since they are recognized by excel as numbers this is not a problem. Unfortunatelty some cells need more attantion.

In order to understand the data I printed all string values.

print(df['Column_name'][df['Column_name'].apply(lambda x: type(x) == str)])

returns:

862     10-12,5
863     10-12,5
864     10-12,5
865     10-12,5
866     10-12,5
867     10-12,5
868     10-12,5
1129      8-12
1130      8-12
1131      8-12
1132      8-12
1133      8-12
Name: Column_name, dtype: object

Adding the following lines returns:

df['Column_name'] = df['Column_name'].str.split('-').str[-1]
df['Column_name'] = pd.to_numeric(df['Column_name'], errors='ignore')
print(df['Column_name'][df['Column_name'].apply(lambda x: type(x) == str)])

still returns

862       12,5
863       12,5
864       12,5
865       12,5
866       12,5
867       12,5
868       12,5
1129        12
1130        12
1131        12
1132        12
1133        12
Name: Column_name, dtype: object

Why are they still strings? I do understand why the ones still containing a comma have not changed, but the other ones I do not get.

Also I tried, using

df['Column_name'] = df['Column_name'].apply(lambda x: str(x).replace(',','.'))

But it messes up all the values that are already floats and everything becomes nan

jezrael · Accepted Answer

Because is used errors='ignore' in to_numeric - if there is error values are returned with no converting.

If 'ignore', then invalid parsing will return the input.

So use errors='coerce' for missing values if cannot convert with replace , to . first:

df['Column_name'] = df['Column_name'].str.split('-').str[-1].str.replace(",", ".")
df['Column_name'] = pd.to_numeric(df['Column_name'], errors='coerce')

EDIT:

If possible some trailing whitespaces first remove them:

df['Column_name'] = (df['Column_name'].astype(str)
                                      .str.strip()
                                      .str.split('-')
                                      .str[-1]
                                      .str.replace(",", "."))
df['Column_name'] = pd.to_numeric(df['Column_name'], errors='coerce')

Not all cells changed by pandas.to_numeric

Answers (2)

Related Questions