Stacey
Stacey

Reputation: 5097

Standardize values in a data-frame column

I have a dataframe df which looks like:

id colour  response
 1   blue    curent 
 2    red   loaning
 3 yellow   current
 4  green      loan 
 5    red   currret
 6  green      loan

You can see the values in the response column are not uniform and I would like to get the to snap to a standardized set of responses.

I also have a validation list validate which looks like

validate
 current
    loan
transfer

I would like to standardise the response column in the df based on the first three characters in the entry against the validate list

So the eventual output would look like:

id colour  response
 1   blue   current
 2    red      loan
 3 yellow   current
 4  green      loan 
 5    red   current
 6  green      loan

have tried to use fnmatch

pattern = 'cur*'
fnmatch.filter(df, pattern) = 'current'

but can't change the values in the df.

If anyone could offer assistance it would be appreciated

Thanks

Upvotes: 0

Views: 323

Answers (2)

BENY
BENY

Reputation: 323226

Fuzzy match ?

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
a=[]
for x in df.response:
    a.append([process.extract(x, val.validate, limit=1)][0][0][0])
df['response2']=a
df
Out[867]: 
   id  colour response response2
0   1    blue   curent   current
1   2     red  loaning      loan
2   3  yellow  current   current
3   4   green     loan      loan
4   5     red  currret   current
5   6   green     loan      loan

Upvotes: 0

Zero
Zero

Reputation: 76917

You could use map

In [3664]: mapping = dict(zip(s.str[:3], s))

In [3665]: df.response.str[:3].map(mapping)
Out[3665]:
0    current
1       loan
2    current
3       loan
4    current
5       loan
Name: response, dtype: object

In [3666]: df['response2'] = df.response.str[:3].map(mapping)

In [3667]: df
Out[3667]:
   id  colour response response2
0   1    blue   curent   current
1   2     red  loaning      loan
2   3  yellow  current   current
3   4   green     loan      loan
4   5     red  currret   current
5   6   green     loan      loan

Where s is series of validation values.

In [3650]: s
Out[3650]:
0     current
1        loan
2    transfer
Name: validate, dtype: object

Details

In [3652]: mapping
Out[3652]: {'cur': 'current', 'loa': 'loan', 'tra': 'transfer'}

mapping can be series too

In [3678]: pd.Series(s.str[:3].values, index=s.values)
Out[3678]:
current     cur
loan        loa
transfer    tra
dtype: object

Upvotes: 2

Related Questions