Alex W.
Alex W.

Reputation: 174

Wrangle messy data with Python pandas

I have a rather straightforward pandas DataFrame:

data = {'product': {0: 'car', 1: 'table', 2: 'computer'},
 'seller A': {0: 123.45, 1: 'x', 2: 3102.99},
 'seller B': {0: '3291,21 eur', 1: '391.11', 2: '1.239,99 Eur'}}

df = pd.DataFrame.from_dict(data)

Before working with the values, I have to bring them into a more regular format. I'm a complete novice in pandas, so I've been having a hard time trying to achieve this.

I have defined a function that is supposed to do the replacements and I have managed (I think) to apply it to the values in the two columns ('seller A' and 'seller B') but the result is not what I'd like.

def function(x):
    if isinstance(x,str):
        x = x.replace('eur','')
        x = x.replace('€','')
        x = x.replace(' Eur','')
    return(x)

for col in df.loc[:,'seller A':'seller B']:
    print(df[col].apply(function))

Intended result (table with floats or integers):

 product seller A      seller B
0       car   123.45       3291,21
1     table      Nan        391.11
2  computer  3102.99       1239.99

Upvotes: 0

Views: 185

Answers (3)

Tom
Tom

Reputation: 8800

The tricky part for me here was handling the commas vs. periods for the decimal. It looks like if there is a comma, it should be the decimal separator (row 1/3), and periods should be treated as thousands separators (row 3). But if there isn't any comma, you can treat a period as the decimal separator if it is in the correct place (row 2).

So I made your function represent that logic by first replace the Euro symbols, then removing periods if they are not followed by 2 numbers, then replacing commas with periods. I also use regex:

import re

def function(x):
    if isinstance(x, str):
        x = re.sub('eur|Eur|€|\s', '', x)

        if '.' in x and x.index('.') != len(x) - 3:
            x = re.sub('\.', '', x)

        x = re.sub(',', '.', x)

    return x

You can then apply this function, and use pd.to_numeric with the 'coerce' option to force to numeric type, including the strings like "x" which didn't look numeric after function was applied:

for col in df.loc[:,'seller A':'seller B']:
    df[col] = pd.to_numeric(df[col].apply(function), errors='coerce')

Output:

    product  seller A  seller B
0       car    123.45   3291.21
1     table       NaN    391.11
2  computer   3102.99   1239.99

# df.dtypes()
# product      object
# seller A    float64
# seller B    float64
# dtype: object

I'm guessing you might need to add more string replacement patterns if there are more cases to catch. pd.to_numeric(..., errors='coerce') will silently replace unconvertible strings with NaN, which can be handy, but in this case you should be careful about that. You could always remove the errors argument to make sure errors are raised, so that you can add more catches to your function (e.g. you would need to add something to handle the cell with 'x').

My main issue with this answer is that it is individually checking each string, which is slow and doesn't leverage any of the pandas.Series.str methods. You maybe could do so if there was a single regex pattern that works for you data (which could get complicated).

Upvotes: 0

bsauce
bsauce

Reputation: 632

You're missing two steps:

  1. Get rid of the thousands separators and use periods as the decimal separators. This is going to be pretty annoying given the variety of different formats your numbers are in, but a basic approach would be to use regex to a) check if there is punctuation in the third-to-last character of the number, b) if so, replace that character with a period, and c) remove all other punctuation in the string.

  2. Cast the string values to a float - just because a string contains only numbers (eg "123") does not mean that it is of type numeric. Here is an updated version of your function that casts to floats (still needs the punctuation cleaning though)

    def function(x): 
        if isinstance(x,str):
             x = x.replace(' eur','')
             x = x.replace('€','')
             x = x.replace(' Eur','')
             try:
                 # try to convert string to float
                 x = float(x)
             except:
                 # if string cannot be converted (eg if it is not a number), return missing
                 x = np.nan  
         return(x)
    

Upvotes: 1

maow
maow

Reputation: 2887

There are to things your function is missing:

  1. Convert ',' to '.'
  2. Convert string to float

Number 1. is analogous to what you did. To convert to numeric, you can e.g. use pd.to_numeric(). However if you hav a string like "car" which cannot be converted it throws a ValueError. You can ignore this with a try-catch block.

In [17]: def function(x): 
    ...:     if isinstance(x,str): 
    ...:         x = x.replace('eur','') 
    ...:         x = x.replace('€','') 
    ...:         x = x.replace(' Eur','') 
    ...:         x = x.replace(',', '.') 
    ...:         try: 
    ...:             x = pd.to_numeric(x) 
    ...:         except ValueError: 
    ...:             pass 
    ...:     return(x) 
    ...:                                                                                                                                                                                                   

In [18]: df.applymap(function)                                                                                                                                                                             
Out[18]: 
    product seller A  seller B
0       car    123.5      3291
1     table        x     391.1
2  computer     3103  1.239.99

In the end I used applymap() to replace the loop and return a DataFrame directly. To make this more performant you should vectorize the function and apply it to complete columns. However, as you also have strings like "x" in the column seller A this is probably the easiest way.

Upvotes: 1

Related Questions