Reputation: 29
Lets say I have a Text file with the below content:
Quetiapine fumarate Drug substance This document
Povidone Binder USP
This line doesn't contain any medicine name.
This line contains Quetiapine fumarate which shouldn't be extracted as it not present at the
beginning of the line.
Dibasic calcium phosphate dihydrate Diluent USP is not present in the csv
Lactose monohydrate Diluent USNF
Magnesium stearate Lubricant USNF
Lactose monohydrate, CI 77491
0.6
Colourant
E 172
Some lines to break the group.
Silicon dioxide colloidal anhydrous
(0.004
Gliding agent
Ph Eur
Adding some random lines.
Povidone
(0.2
Lubricant
Ph Eur
I have a csv containing a list of medicine name which I want to match inside the .txt file and extract all the data that is present between 2 unique medicines(when the medicine name is at the beginning of the line).(Example of medicines from the csv file are 'Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate' etc etc.)
I want to iterate each line of my text file and create groups from one medicine to another.
This should only happen if the medicine name is present at the start of the newline and is not present in between a line.
Expected output:
['Quetiapine fumarate Drug substance This document'],
['Povidone Binder USP'],
['Lactose monohydrate Diluent USNF'],
['Magnesium stearate Lubricant USNF'],
[Lactose monohydrate, CI 77491
0.6
Colourant
E 172],
[Povidone
(0.2
Lubricant
Ph Eur]
Can someone please help me with the same to do this in Python?
Attempt till now:
medicines = ('Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate')
result = []
with open('C:/Users/test1.txt', 'r', encoding='utf8') as f:
for line in f:
if any(line.startswith(med) for med in medicines):
result.append(line.strip())
which captures output till here but I need the remaining part as well:
['Quetiapine fumarate Drug substance This document'],
['Povidone Binder USP'],
['Lactose monohydrate Diluent USNF'],
['Magnesium stearate Lubricant USNF']
I need to capture all the text from one medicine to another as shown in Expected Output. If there is only one medicine name present in a line, I need to capture data from the next four lines and form a group where a number will come in the next line after medicine as shown in the output.
Upvotes: 0
Views: 113
Reputation: 626689
You may use this regex with the re.M
option:
^\s*(?:Quetiapine fumarate|Povidone|Magnesium stearate|Lactose monohydrate).*(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})?
See the regex demo
Details
^
- start of a line\s*
- 0 or more whitespaces(?:Quetiapine fumarate|Povidone|Magnesium stearate|Lactose monohydrate)
- your list of medicines.*
- rest of the line(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})?
- an optional string of
\n
- newline[^\w\n]*
- 0+ chars other than word and newline chars\d*\.?\d+
- a number[^\w\n]*
- 0+ chars other than word and newline chars(?:\n.*){2}
- two occurrences of a newline and the rest of the linePython (see Python demo online):
import re
medicines = ['Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate']
result = []
med = r"(?:{})".format("|".join(map(re.escape, medicines)))
pattern = re.compile(r"^\s*" + med + r".*(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})?", re.M)
with open('C:/Users/test1.txt', 'r', encoding='utf8') as f:
result = pattern.findall(f.read())
Upvotes: 2