Reputation: 157
I have a dataset which contains data scraped from a real estate agency. I want to clean the variables from special characters (€, m², etc), remove whitespaces and convert them into integers. For instance I have these for the price (Prix de vente) and size (Surface habitable):
I manage to strip the "€" from the price variable with the following:
data["Prix de vente"] = data["Prix de vente"].str.strip("€")
However none of these two techniques work to remove the whitespaces
data["Prix de vente"] = data["Prix de vente"].str.strip()
data["Prix de vente"] = data["Prix de vente"].replace(" ","")
I keep having:
540 000
900 000
890 000
900 000
900 000
...
I have a feeling this is due to the way my data is encoded. Would that be the issue? And if so, how do I deal with it? I'm using a jupyter notebook.
Upvotes: 1
Views: 811
Reputation: 56
You can do
data["Prix de vente"] = data["Prix de vente"].str.replace("\s","", regex=True)
Matching "\s"
combined with regex=True
instead of " "
makes sure to match not only regular spaces but any whitespace character
More background:
Your first approach
data["Prix de vente"] = data["Prix de vente"].str.strip()
does not work, since .strip() only removes leading and trailing characters.
Your second approach
data["Prix de vente"] = data["Prix de vente"].replace(" ","")
does not work, since it is using not str.replace, but pd.Series.replace, which only replaces exact matches of values (e.g. "900 000").
Upvotes: 4
Reputation: 9533
There is the usual way (but ugly):
data["Prix de vente"] = "".join(data["Prix de vente"].str.split())
How it works?
The string method split()
will split the string on white spaces. Generic white spaces, so not just " "
, but also "\t"
, "\u00a0"
, etc.
Then you join again the substrings, without any spaces.
So you remove the spaces (in not a much intuitive way, but it such method is frequently used, sometimes just to normalize spaces).
Why this works and but not your solution?
Some locales write numbers and currencies with NBSP (non breaking space, \u00A0), instead of standard spaces, which is more sensible (avoiding line break bewteen digit groups). Also some formatters may use other kind of spaces (e.g. small spaces) for such cases, so the solution should care about handling all different white spaces.
Upvotes: 0