Reputation: 174
I have a rather straightforward pandas DataFrame:
data = {'product': {0: 'car', 1: 'table', 2: 'computer'},
'seller A': {0: 123.45, 1: 'x', 2: 3102.99},
'seller B': {0: '3291,21 eur', 1: '391.11', 2: '1.239,99 Eur'}}
df = pd.DataFrame.from_dict(data)
Before working with the values, I have to bring them into a more regular format. I'm a complete novice in pandas, so I've been having a hard time trying to achieve this.
I have defined a function
that is supposed to do the replacements and I have managed (I think) to apply it to the values in the two columns ('seller A' and 'seller B') but the result is not what I'd like.
def function(x):
if isinstance(x,str):
x = x.replace('eur','')
x = x.replace('€','')
x = x.replace(' Eur','')
return(x)
for col in df.loc[:,'seller A':'seller B']:
print(df[col].apply(function))
Intended result (table with floats or integers):
product seller A seller B
0 car 123.45 3291,21
1 table Nan 391.11
2 computer 3102.99 1239.99
Upvotes: 0
Views: 185
Reputation: 8800
The tricky part for me here was handling the commas vs. periods for the decimal. It looks like if there is a comma, it should be the decimal separator (row 1/3), and periods should be treated as thousands separators (row 3). But if there isn't any comma, you can treat a period as the decimal separator if it is in the correct place (row 2).
So I made your function
represent that logic by first replace the Euro symbols, then removing periods if they are not followed by 2 numbers, then replacing commas with periods. I also use regex:
import re
def function(x):
if isinstance(x, str):
x = re.sub('eur|Eur|€|\s', '', x)
if '.' in x and x.index('.') != len(x) - 3:
x = re.sub('\.', '', x)
x = re.sub(',', '.', x)
return x
You can then apply
this function, and use pd.to_numeric
with the 'coerce'
option to force to numeric type, including the strings like "x"
which didn't look numeric after function
was applied:
for col in df.loc[:,'seller A':'seller B']:
df[col] = pd.to_numeric(df[col].apply(function), errors='coerce')
Output:
product seller A seller B
0 car 123.45 3291.21
1 table NaN 391.11
2 computer 3102.99 1239.99
# df.dtypes()
# product object
# seller A float64
# seller B float64
# dtype: object
I'm guessing you might need to add more string replacement patterns if there are more cases to catch. pd.to_numeric(..., errors='coerce')
will silently replace unconvertible strings with NaN
, which can be handy, but in this case you should be careful about that. You could always remove the errors
argument to make sure errors are raised, so that you can add more catches to your function (e.g. you would need to add something to handle the cell with 'x'
).
My main issue with this answer is that it is individually checking each string, which is slow and doesn't leverage any of the pandas.Series.str
methods. You maybe could do so if there was a single regex pattern that works for you data (which could get complicated).
Upvotes: 0
Reputation: 632
You're missing two steps:
Get rid of the thousands separators and use periods as the decimal separators. This is going to be pretty annoying given the variety of different formats your numbers are in, but a basic approach would be to use regex to a) check if there is punctuation in the third-to-last character of the number, b) if so, replace that character with a period, and c) remove all other punctuation in the string.
Cast the string values to a float - just because a string contains only numbers (eg "123") does not mean that it is of type numeric. Here is an updated version of your function that casts to floats (still needs the punctuation cleaning though)
def function(x):
if isinstance(x,str):
x = x.replace(' eur','')
x = x.replace('€','')
x = x.replace(' Eur','')
try:
# try to convert string to float
x = float(x)
except:
# if string cannot be converted (eg if it is not a number), return missing
x = np.nan
return(x)
Upvotes: 1
Reputation: 2887
There are to things your function is missing:
','
to '.'
Number 1. is analogous to what you did. To convert to numeric, you can e.g. use pd.to_numeric()
. However if you hav a string like "car"
which cannot be converted it throws a ValueError
. You can ignore this with a try-catch block.
In [17]: def function(x):
...: if isinstance(x,str):
...: x = x.replace('eur','')
...: x = x.replace('€','')
...: x = x.replace(' Eur','')
...: x = x.replace(',', '.')
...: try:
...: x = pd.to_numeric(x)
...: except ValueError:
...: pass
...: return(x)
...:
In [18]: df.applymap(function)
Out[18]:
product seller A seller B
0 car 123.5 3291
1 table x 391.1
2 computer 3103 1.239.99
In the end I used applymap()
to replace the loop and return a DataFrame directly.
To make this more performant you should vectorize the function and apply it to complete columns. However, as you also have strings like "x"
in the column seller A
this is probably the easiest way.
Upvotes: 1