Reputation: 328
Suppose there is a dataframe defined as
df = pd.DataFrame({'Col_1': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', '0'],
'Col_2': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', '0']})
which looks like
Col_1 Col_2
0 A a
1 B b
2 C c
3 D d
4 E e
5 F f
6 G g
7 H h
8 I i
9 J j
10 0 0
I would like to replace the values in Col_1
by using a dictionary defined as
repl_dict = {re.compile('[ABH-LP-Z]'): 'DDD',
re.compile('[CDEFG]'): 'BBB WTT',
re.compile('[MNO]'): 'AAA WTT',
re.compile('[0-9]'): 'CCC'}
I would expect to get a new dataframe in which the Col_1
should have been as follows
Col_1
0 DDD
1 DDD
2 BBB WTT
3 BBB WTT
4 BBB WTT
5 BBB WTT
6 BBB WTT
7 DDD
8 DDD
9 DDD
10 CCC
I just simply use df['Col_1'].replace(repl_dict, regex=True)
. However, it does not produce what I expected. What I've got is like:
Col_1
0 BBB WTTBBB WTTBBB WTT
1 BBB WTTBBB WTTBBB WTT
2 BBB WTT
3 BBB WTT
4 BBB WTT
5 BBB WTT
6 BBB WTT
7 BBB WTTBBB WTTBBB WTT
8 BBB WTTBBB WTTBBB WTT
9 BBB WTTBBB WTTBBB WTT
10 CCC
I would appreciate it very much if anyone could let me know why the df.replace()
was not working for me and what would be a correct way to replace multiple values to get the expected output.
Upvotes: 1
Views: 4315
Reputation: 103
A more realistic scenario could be where you would want reclassify entries based on a pattern as follows:
Consider dataframe 'x' as follows:
column
0 good farmer
1 bad farmer
2 ok farmer
3 worker did wrong
4 worker fired
5 worker hired
6 heavy duty work
7 light duty work
Then consider the following code:
x['column_reclassified'] = x['column'].replace(
to_replace=[
'^.*(farmer).*$',
'^.*(worker).*$',
'^.*(duty).*$'
],
value=[
'FARMER',
'WORKER',
'DUTY'
],
regex=True
)
and it will produce the following output:
column column_reclassified
0 good farmer FARMER
1 bad farmer FARMER
2 ok farmer FARMER
3 worker did wrong WORKER
4 worker fired WORKER
5 worker hired WORKER
6 heavy duty work DUTY
7 light duty work DUTY
Hope this also helps.
Upvotes: 0
Reputation: 43169
Use anchors (^
and $
, that is):
repl_dict = {re.compile('^[ABH-LP-Z]$'): 'DDD',
re.compile('^[CDEFG]$'): 'BBB WTT',
re.compile('^[MNO]$'): 'AAA WTT',
re.compile('^[0-9]+$'): 'CCC'}
Which produces with df['Col_1'].replace(repl_dict, regex=True)
:
0 DDD
1 DDD
2 BBB WTT
3 BBB WTT
4 BBB WTT
5 BBB WTT
6 BBB WTT
7 DDD
8 DDD
9 DDD
10 CCC
Upvotes: 3