Pandas remove text after a set of keywords in the column

Question

I have a pandas data frame which looks like as shown below

df = pd.DataFrame({'person_id': [11,11,11,11,11,11],
                   'details':['DOSE: 667 mg - TDS with food - Inject','DOSE: 667 mg - TDS with food - PO PRN Test','DOSE: 667 mg - TDS with food - Tube','DOSE: 667 mg - TDS with food - Inject  hypo gdksajgfkjsagf > 4 test value here','DOSE: 667 mg - TDS with food - Tube START 10:00 AM','DOSE: 667 mg - TDS with food - PO QTY removed']})

I would like to remove part of the text from the details column based on below logic

a) If the text value contains keywords such as PRN, Hypo, START or QTY removed, then remove everything that comes after those keywords. The keywords could be lower or upper case as well

I was trying something like below

df['text'].str.rsplit(r'prn|hypo|start|qty removed', 1).str.get(0) # this doesn't work

I expect my output to be like as shown below. Please note that my datatset has 4-5 million records, so any efficient and elegant approach would really be helpful

Wiktor Stribiżew · Accepted Answer

You can use Series.str.replace rather than rsplit:

df['details'] = df['details'].str.replace(r'(?i)\s*\b(prn|hypo|start|qty removed).*', '')
# >>> df['details'].str.replace(r'(?i)\s*\b(prn|hypo|start|qty removed).*', '')
# 0    DOSE: 667 mg - TDS with food - Inject
# 1        DOSE: 667 mg - TDS with food - PO
# 2      DOSE: 667 mg - TDS with food - Tube
# 3    DOSE: 667 mg - TDS with food - Inject
# 4      DOSE: 667 mg - TDS with food - Tube
# 5        DOSE: 667 mg - TDS with food - PO
# Name: details, dtype: object

See the regex demo. Details:

(?i) - an inline modifier making regex search case insensitive
\s* - zero or more whitespace chars
\b - a whole word marker, word boundary
(prn|hypo|start|qty removed) - a group of alternative strings (capturing, you may turn it into a non-capturing by replacing ( with (?:)
.* - the rest of the line (any zero or more chars other than line break chars, as many as possible).

Pandas remove text after a set of keywords in the column

Answers (2)

Related Questions