Reputation: 135
I have text in a column I am cleaning, I need to remove all words between the words "Original" and "Subject" wherever they appear in the column which is only some of the rows.
I am currently trying
a = df['textcol']
import re
df['textcol'] =re.sub('Original.*?Subject','',str(a), flags=re.DOTALL)
this function is making every string within every tow the exact same as the first row it instead of looking at each row individually and altering it
Upvotes: 4
Views: 1478
Reputation: 626804
You need to use Series.str.replace
directly:
df['textcol'] = df['textcol'].str.replace(r'(?s)Original.*?Subject', '', regex=True)
Here, (?s)
stands for re.DOTALL
/ re.S
in order not to have to import re
, it is their inline modifier version. The .*?
matches any zero or more chars, as few as possible.
If Original
and Subject
need to be passed as variables containing literal text, do not forget about re.escape
:
import re
# ... etc. ...
start = "Original"
end = "Subject"
df['textcol'] = df['textcol'].str.replace(fr'(?s){re.escape(start)}.*?{re.escape(end)}', '', regex=True)
Upvotes: 4