Reputation: 7703
I have a pandas data frame which looks like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11],
'details':['DOSE: 667 mg - TDS with food - Inject','DOSE: 667 mg - TDS with food - PO PRN Test','DOSE: 667 mg - TDS with food - Tube','DOSE: 667 mg - TDS with food - Inject hypo gdksajgfkjsagf > 4 test value here','DOSE: 667 mg - TDS with food - Tube START 10:00 AM','DOSE: 667 mg - TDS with food - PO QTY removed']})
I would like to remove part of the text from the details
column based on below logic
a) If the text value contains keywords such as PRN
, Hypo
, START
or QTY removed
, then remove everything that comes after those keywords. The keywords could be lower or upper case as well
I was trying something like below
df['text'].str.rsplit(r'prn|hypo|start|qty removed', 1).str.get(0) # this doesn't work
I expect my output to be like as shown below. Please note that my datatset has 4-5 million records, so any efficient and elegant approach would really be helpful
Upvotes: 1
Views: 525
Reputation: 626804
You can use Series.str.replace
rather than rsplit
:
df['details'] = df['details'].str.replace(r'(?i)\s*\b(prn|hypo|start|qty removed).*', '')
# >>> df['details'].str.replace(r'(?i)\s*\b(prn|hypo|start|qty removed).*', '')
# 0 DOSE: 667 mg - TDS with food - Inject
# 1 DOSE: 667 mg - TDS with food - PO
# 2 DOSE: 667 mg - TDS with food - Tube
# 3 DOSE: 667 mg - TDS with food - Inject
# 4 DOSE: 667 mg - TDS with food - Tube
# 5 DOSE: 667 mg - TDS with food - PO
# Name: details, dtype: object
See the regex demo. Details:
(?i)
- an inline modifier making regex search case insensitive\s*
- zero or more whitespace chars\b
- a whole word marker, word boundary(prn|hypo|start|qty removed)
- a group of alternative strings (capturing, you may turn it into a non-capturing by replacing (
with (?:
).*
- the rest of the line (any zero or more chars other than line break chars, as many as possible).Upvotes: 1
Reputation: 30920
We could do Series.str.replace
with case = False
in order
not to distinguish between upper and lower case
df['details'] = df['details'].str.replace('prn|hypo|start|qty removed', '',
regex=True, case=False)
Upvotes: 0