The Great
The Great

Reputation: 7703

Pandas remove text after a set of keywords in the column

I have a pandas data frame which looks like as shown below

df = pd.DataFrame({'person_id': [11,11,11,11,11,11],
                   'details':['DOSE: 667 mg - TDS with food - Inject','DOSE: 667 mg - TDS with food - PO PRN Test','DOSE: 667 mg - TDS with food - Tube','DOSE: 667 mg - TDS with food - Inject  hypo gdksajgfkjsagf > 4 test value here','DOSE: 667 mg - TDS with food - Tube START 10:00 AM','DOSE: 667 mg - TDS with food - PO QTY removed']})

I would like to remove part of the text from the details column based on below logic

a) If the text value contains keywords such as PRN, Hypo, START or QTY removed, then remove everything that comes after those keywords. The keywords could be lower or upper case as well

I was trying something like below

df['text'].str.rsplit(r'prn|hypo|start|qty removed', 1).str.get(0) # this doesn't work

I expect my output to be like as shown below. Please note that my datatset has 4-5 million records, so any efficient and elegant approach would really be helpful

enter image description here

Upvotes: 1

Views: 525

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626804

You can use Series.str.replace rather than rsplit:

df['details'] = df['details'].str.replace(r'(?i)\s*\b(prn|hypo|start|qty removed).*', '')
# >>> df['details'].str.replace(r'(?i)\s*\b(prn|hypo|start|qty removed).*', '')
# 0    DOSE: 667 mg - TDS with food - Inject
# 1        DOSE: 667 mg - TDS with food - PO
# 2      DOSE: 667 mg - TDS with food - Tube
# 3    DOSE: 667 mg - TDS with food - Inject
# 4      DOSE: 667 mg - TDS with food - Tube
# 5        DOSE: 667 mg - TDS with food - PO
# Name: details, dtype: object

See the regex demo. Details:

  • (?i) - an inline modifier making regex search case insensitive
  • \s* - zero or more whitespace chars
  • \b - a whole word marker, word boundary
  • (prn|hypo|start|qty removed) - a group of alternative strings (capturing, you may turn it into a non-capturing by replacing ( with (?:)
  • .* - the rest of the line (any zero or more chars other than line break chars, as many as possible).

Upvotes: 1

ansev
ansev

Reputation: 30920

We could do Series.str.replace with case = False in order not to distinguish between upper and lower case

df['details'] = df['details'].str.replace('prn|hypo|start|qty removed', '', 
                                          regex=True, case=False)

Upvotes: 0

Related Questions