Tom S
Tom S

Reputation: 631

Not all cells changed by pandas.to_numeric

I have a dataframe with 30 columns and 1000 rows. It is read using:

df = pd.read_excel(filepath, sheet_name = "sheetname", header = 2)

Within the excel floats are comma seperated, but since they are recognized by excel as numbers this is not a problem. Unfortunatelty some cells need more attantion.

In order to understand the data I printed all string values.

print(df['Column_name'][df['Column_name'].apply(lambda x: type(x) == str)])

returns:

862     10-12,5
863     10-12,5
864     10-12,5
865     10-12,5
866     10-12,5
867     10-12,5
868     10-12,5
1129      8-12
1130      8-12
1131      8-12
1132      8-12
1133      8-12
Name: Column_name, dtype: object

Adding the following lines returns:

df['Column_name'] = df['Column_name'].str.split('-').str[-1]
df['Column_name'] = pd.to_numeric(df['Column_name'], errors='ignore')
print(df['Column_name'][df['Column_name'].apply(lambda x: type(x) == str)])

still returns

862       12,5
863       12,5
864       12,5
865       12,5
866       12,5
867       12,5
868       12,5
1129        12
1130        12
1131        12
1132        12
1133        12
Name: Column_name, dtype: object

Why are they still strings? I do understand why the ones still containing a comma have not changed, but the other ones I do not get.

Also I tried, using

df['Column_name'] = df['Column_name'].apply(lambda x: str(x).replace(',','.'))

But it messes up all the values that are already floats and everything becomes nan

Upvotes: 1

Views: 139

Answers (2)

jezrael
jezrael

Reputation: 862511

Because is used errors='ignore' in to_numeric - if there is error values are returned with no converting.

If 'ignore', then invalid parsing will return the input.

So use errors='coerce' for missing values if cannot convert with replace , to . first:

df['Column_name'] = df['Column_name'].str.split('-').str[-1].str.replace(",", ".")
df['Column_name'] = pd.to_numeric(df['Column_name'], errors='coerce')

EDIT:

If possible some trailing whitespaces first remove them:

df['Column_name'] = (df['Column_name'].astype(str)
                                      .str.strip()
                                      .str.split('-')
                                      .str[-1]
                                      .str.replace(",", "."))
df['Column_name'] = pd.to_numeric(df['Column_name'], errors='coerce')

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195418

If you have dataframe:

     Column_name
862      10-12,5
863      10-12,5
864      10-12,5
865      10-12,5
866      10-12,5
867      10-12,5
868      10-12,5
1129        8-12
1130        8-12
1131        8-12
1132        8-12
1133        8-12

Then:

df["Column_name"] = (
    df["Column_name"].str.split("-").str[-1].str.replace(",", ".").astype(float)
)
print(df)

Prints:

      Column_name
862          12.5
863          12.5
864          12.5
865          12.5
866          12.5
867          12.5
868          12.5
1129         12.0
1130         12.0
1131         12.0
1132         12.0
1133         12.0

Upvotes: 1

Related Questions