Reputation: 793
Background
I have the following df
import pandas as pd
df = pd.DataFrame({'Text' : ['But the here is \nBase ID: 666666 \nDate is Here 123456 ',
'999998 For \nBase ID: 123456 \nDate there',
'So so \nBase ID: 939393 \nDate hey the 123455 ',],
'ID': [1,2,3],
'P_ID': ['A','B','C'],
})
Output
ID P_ID Text
0 1 A But the here is \nBase ID: 666666 \nDate is Here 123456
1 2 B 999998 For \nBase ID: 123456 \nDate there
2 3 C So so \nBase ID: 939393 \nDate hey the 123455
Tried
I have tried the following to **BLOCK**
the 6 digits in between \nBase ID:
and \nDate
df['New_Text'] = df['Text'].str.replace('ID:(.+?)','ID:**BLOCK**')
And I get the following
ID P_ID Text New_Text
0 But the here is \nBase ID:**BLOCK**666666 \nDate is Here 123456
1 999998 For \nBase ID:**BLOCK**123456 \nDate there
2 So so \nBase ID:**BLOCK**939393 \nDate hey the 123455
But I don't quite get what I want
Desired Output
ID P_ID Text New_Text
0 But the here is \nBase ID:**BLOCK** \nDate is Here 123456
1 999998 For \nBase ID:**BLOCK** \nDate there
2 So so \nBase ID:**BLOCK** \nDate hey the 123455
Question
How do I tweak str.replace('ID:(.+?)','ID:**BLOCK**')
part of my code to get my desired output?
Upvotes: 1
Views: 80
Reputation: 663
You can try with below piece of code to get your desired output,
df['New_Text'] = df['Text'].str.replace('ID:\s+[0-9]+','ID:**BLOCK**')
Output:
0 But the here is \nCase ID:**BLOCK** \nDate is Here 123456
1 999998 For \nCase ID:**BLOCK** \nDate there
2 So so \nCase ID:**BLOCK** \nDate hey the 123455
Regex Breakdown:
'\s+' - to indicate space(s)
'[0-9]+' - to specify a number
Upvotes: 1
Reputation: 420
try df['New_Text'] = df['Text'].str.replace('ID:(.+?)\n','ID:**BLOCK**\n')
regexp match the shortest possible string, in your case ' '
Upvotes: 1