LBedo
LBedo

Reputation: 157

python - removing whitespaces not working - encoding issue?

I have a dataset which contains data scraped from a real estate agency. I want to clean the variables from special characters (€, m², etc), remove whitespaces and convert them into integers. For instance I have these for the price (Prix de vente) and size (Surface habitable):

enter image description here

I manage to strip the "€" from the price variable with the following:

data["Prix de vente"] = data["Prix de vente"].str.strip("€")

However none of these two techniques work to remove the whitespaces

data["Prix de vente"] = data["Prix de vente"].str.strip() 
data["Prix de vente"] = data["Prix de vente"].replace(" ","")

I keep having:

      540 000
      900 000
      890 000
      900 000
      900 000
      ...

I have a feeling this is due to the way my data is encoded. Would that be the issue? And if so, how do I deal with it? I'm using a jupyter notebook.

Upvotes: 1

Views: 811

Answers (2)

Peter
Peter

Reputation: 56

You can do

data["Prix de vente"] = data["Prix de vente"].str.replace("\s","", regex=True)

Matching "\s" combined with regex=True instead of " " makes sure to match not only regular spaces but any whitespace character

More background:

Your first approach

data["Prix de vente"] = data["Prix de vente"].str.strip()

does not work, since .strip() only removes leading and trailing characters.

Your second approach

data["Prix de vente"] = data["Prix de vente"].replace(" ","")

does not work, since it is using not str.replace, but pd.Series.replace, which only replaces exact matches of values (e.g. "900 000").

Upvotes: 4

Giacomo Catenazzi
Giacomo Catenazzi

Reputation: 9533

There is the usual way (but ugly):

data["Prix de vente"] = "".join(data["Prix de vente"].str.split())

How it works?

The string method split() will split the string on white spaces. Generic white spaces, so not just " ", but also "\t", "\u00a0", etc. Then you join again the substrings, without any spaces. So you remove the spaces (in not a much intuitive way, but it such method is frequently used, sometimes just to normalize spaces).

Why this works and but not your solution?

Some locales write numbers and currencies with NBSP (non breaking space, \u00A0), instead of standard spaces, which is more sensible (avoiding line break bewteen digit groups). Also some formatters may use other kind of spaces (e.g. small spaces) for such cases, so the solution should care about handling all different white spaces.

Upvotes: 0

Related Questions