R A
R A

Reputation: 115

Replacing string in pandas python only if it matches the exact string

I'm having trouble getting a string in pandas to be replaced in the correct manner. I'm not sure if I'm limited to using pandas and there might not be a way to do this with solely using pandas.

This is how my dataframe looks:

 (ID: 10)              247333605                      0.0  
  (ID: 20)               36738870                      0.0  
  (ID: 40)             4668036427                      0.0  
  (ID: 50)             1918647972                      0.0  
  (ID: 60)             4323165902                  44125.0  
  (ID: 80)              145512255                      0.0  
 Assigned (ID: 30)       42050340                      0.0  
 Assigned (ID: 40)   130880371376                      0.0  
 Assigning (ID: 30)    1095844753                      0.0  
 Cancelled (ID: 40)        937280                      0.0  
 Cancelled (ID: 80)   16857720813                      0.0  
 Planned (ID: 20)      9060392597                      0.0  
 Planning (ID: 10)   108484297031                      0.0  
 Processed (ID: 70)  133289880880                      0.0  
 Revoked (ID: 50)      2411903072                      0.0  
 Writing (ID: 50)    146408550024                      0.0  
 Written (ID: 60)    139458227923                1018230.0  

For each (ID: x), it should be matched to the assigned (ID: x), cancelled (ID: x), etc with the correct ID.

Using lines similar to this line:

input_data['last_status'] = input_data.last_status.str.replace('(ID: 10)', 'Planning (ID: 10)')

My output is:

(Assigned (ID: 40))                                0.0  
  (Cancelled (ID: 80))                               0.0  
  (Planned (ID: 20))                                 0.0  
  (Planning (ID: 10))                                0.0  
  (Writing (ID: 50))                                 0.0  
  (Written (ID: 60))                             44125.0  
 Assigned (Assigned (ID: 40))                        0.0  
 Assigned (ID: 30)                                   0.0  
 Assigning (ID: 30)                                  0.0  
 Cancelled (Assigned (ID: 40))                       0.0  
 Cancelled (Cancelled (ID: 80))                      0.0  
 Planned (Planned (ID: 20))                          0.0  
 Planning (Planning (ID: 10))                        0.0  
 Processed (ID: 70)                                  0.0  
 Revoked (Writing (ID: 50))                          0.0  
 Writing (Writing (ID: 50))                          0.0  
 Written (Written (ID: 60))                    1018230.0  

As you can see, all the (ID: x) got replaced and it still doesn't match the correct term.

My ideal dataframe would look like this:

 Assigned (ID: 30)       42050340                      0.0  
 Assigned (ID: 40)   130880371376                      0.0  
 Assigning (ID: 30)    1095844753                      0.0  
 Cancelled (ID: 40)        937280                      0.0  
 Cancelled (ID: 80)   16857720813                      0.0  
 Planned (ID: 20)      9060392597                      0.0  
 Planning (ID: 10)   108484297031                      0.0  
 Processed (ID: 70)  133289880880                      0.0  
 Revoked (ID: 50)      2411903072                      0.0  
 Writing (ID: 50)    146408550024                      0.0  
 Written (ID: 60)    139458227923                1018230.0 

I'm bound to using pandas because the dataset is huge, I have a different implementation but they take me days to run. Is there a way to do this right in pandas?

I've never asked something before on stackoverflow. I hope my question is clear.

Upvotes: 3

Views: 5063

Answers (1)

cs95
cs95

Reputation: 402922

If you want to generalise, you can use str.replace with SOL/EOL anchors.

df['last_status'].str.replace(r'^(\(ID: \d+\))$', r'Planning: \1')

0     Planning: (ID: 10)
1     Planning: (ID: 20)
2     Planning: (ID: 40)
3     Planning: (ID: 50)
4     Planning: (ID: 60)
5     Planning: (ID: 80)
6      Assigned (ID: 30)
7      Assigned (ID: 40)
8     Assigning (ID: 30)
9     Cancelled (ID: 40)
10    Cancelled (ID: 80)
11      Planned (ID: 20)
12     Planning (ID: 10)
13    Processed (ID: 70)
14      Revoked (ID: 50)
15      Writing (ID: 50)
16      Written (ID: 60)
Name: last_status, dtype: object

If you want to replace a specific ID only, change your regex to -

r'^(\(ID: 10\))$'

Or,

r'^(\(ID: {}\))$'.format(number)

Where number is a variable that holds the ID value to perform replacement on.

Upvotes: 3

Related Questions