gim carey
gim carey

Reputation: 29

Extract textual data in between two strings in a text file using Python

Lets say I have a Text file with the below content:

    Quetiapine fumarate Drug substance  This document
    Povidone    Binder  USP
    This line doesn't contain any medicine name.
    This line contains Quetiapine fumarate which shouldn't be extracted as it not present at the 
    beginning of the line.
    Dibasic calcium phosphate dihydrate Diluent USP is not present in the csv
    Lactose monohydrate Diluent USNF
    Magnesium stearate  Lubricant   USNF


    Lactose monohydrate, CI 77491   
    0.6
    Colourant
    E 172

    Some lines to break the group.
    Silicon dioxide colloidal anhydrous
    (0.004
    Gliding agent
    Ph Eur

    Adding some random lines.

    Povidone
    (0.2
    Lubricant
    Ph Eur

I have a csv containing a list of medicine name which I want to match inside the .txt file and extract all the data that is present between 2 unique medicines(when the medicine name is at the beginning of the line).(Example of medicines from the csv file are 'Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate' etc etc.)

I want to iterate each line of my text file and create groups from one medicine to another.

This should only happen if the medicine name is present at the start of the newline and is not present in between a line.

Expected output:

['Quetiapine fumarate   Drug substance  This document'],
['Povidone  Binder  USP'],
['Lactose monohydrate   Diluent USNF'],
['Magnesium stearate    Lubricant   USNF'],
[Lactose monohydrate, CI 77491  
    0.6
    Colourant
    E 172],

[Povidone
    (0.2
    Lubricant
    Ph Eur]

Can someone please help me with the same to do this in Python?

Attempt till now:

medicines = ('Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate')

result = []
with open('C:/Users/test1.txt', 'r', encoding='utf8') as f:
    for line in f:
        if any(line.startswith(med) for med in medicines):
            result.append(line.strip())

which captures output till here but I need the remaining part as well:

['Quetiapine fumarate   Drug substance  This document'],
['Povidone  Binder  USP'],
['Lactose monohydrate   Diluent USNF'],
['Magnesium stearate    Lubricant   USNF']

I need to capture all the text from one medicine to another as shown in Expected Output. If there is only one medicine name present in a line, I need to capture data from the next four lines and form a group where a number will come in the next line after medicine as shown in the output.

Upvotes: 0

Views: 113

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You may use this regex with the re.M option:

^\s*(?:Quetiapine fumarate|Povidone|Magnesium stearate|Lactose monohydrate).*(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})?

See the regex demo

Details

  • ^ - start of a line
  • \s* - 0 or more whitespaces
  • (?:Quetiapine fumarate|Povidone|Magnesium stearate|Lactose monohydrate) - your list of medicines
  • .* - rest of the line
  • (?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})? - an optional string of
    • \n - newline
    • [^\w\n]* - 0+ chars other than word and newline chars
    • \d*\.?\d+ - a number
    • [^\w\n]* - 0+ chars other than word and newline chars
    • (?:\n.*){2} - two occurrences of a newline and the rest of the line

Python (see Python demo online):

import re

medicines = ['Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate']

result = []
med = r"(?:{})".format("|".join(map(re.escape, medicines)))
pattern = re.compile(r"^\s*" + med + r".*(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})?", re.M)
with open('C:/Users/test1.txt', 'r', encoding='utf8') as f:
    result = pattern.findall(f.read())

Upvotes: 2

Related Questions