Reputation: 631
I have a dataframe with 30 columns and 1000 rows. It is read using:
df = pd.read_excel(filepath, sheet_name = "sheetname", header = 2)
Within the excel floats are comma seperated, but since they are recognized by excel as numbers this is not a problem. Unfortunatelty some cells need more attantion.
In order to understand the data I printed all string values.
print(df['Column_name'][df['Column_name'].apply(lambda x: type(x) == str)])
returns:
862 10-12,5
863 10-12,5
864 10-12,5
865 10-12,5
866 10-12,5
867 10-12,5
868 10-12,5
1129 8-12
1130 8-12
1131 8-12
1132 8-12
1133 8-12
Name: Column_name, dtype: object
Adding the following lines returns:
df['Column_name'] = df['Column_name'].str.split('-').str[-1]
df['Column_name'] = pd.to_numeric(df['Column_name'], errors='ignore')
print(df['Column_name'][df['Column_name'].apply(lambda x: type(x) == str)])
still returns
862 12,5
863 12,5
864 12,5
865 12,5
866 12,5
867 12,5
868 12,5
1129 12
1130 12
1131 12
1132 12
1133 12
Name: Column_name, dtype: object
Why are they still strings? I do understand why the ones still containing a comma have not changed, but the other ones I do not get.
Also I tried, using
df['Column_name'] = df['Column_name'].apply(lambda x: str(x).replace(',','.'))
But it messes up all the values that are already floats and everything becomes nan
Upvotes: 1
Views: 139
Reputation: 862511
Because is used errors='ignore'
in to_numeric
- if there is error values are returned with no converting.
If 'ignore', then invalid parsing will return the input.
So use errors='coerce'
for missing values if cannot convert with replace ,
to .
first:
df['Column_name'] = df['Column_name'].str.split('-').str[-1].str.replace(",", ".")
df['Column_name'] = pd.to_numeric(df['Column_name'], errors='coerce')
EDIT:
If possible some trailing whitespaces first remove them:
df['Column_name'] = (df['Column_name'].astype(str)
.str.strip()
.str.split('-')
.str[-1]
.str.replace(",", "."))
df['Column_name'] = pd.to_numeric(df['Column_name'], errors='coerce')
Upvotes: 1
Reputation: 195418
If you have dataframe:
Column_name
862 10-12,5
863 10-12,5
864 10-12,5
865 10-12,5
866 10-12,5
867 10-12,5
868 10-12,5
1129 8-12
1130 8-12
1131 8-12
1132 8-12
1133 8-12
Then:
df["Column_name"] = (
df["Column_name"].str.split("-").str[-1].str.replace(",", ".").astype(float)
)
print(df)
Prints:
Column_name
862 12.5
863 12.5
864 12.5
865 12.5
866 12.5
867 12.5
868 12.5
1129 12.0
1130 12.0
1131 12.0
1132 12.0
1133 12.0
Upvotes: 1