Mahshid Nik Ravesh
Mahshid Nik Ravesh

Reputation: 7

How to extract 3 and or more words after a specific word

I've been trying to extract 3 and or more words after Diagnosis: or diagnosis: to no avail.

This is the code I've been trying:

'diagnosis: \s+((?:\w+(?:\s+|$)){2})'

prints empty.

I have managed to make this code work:

"Diagnosis: (\w+)",
       "diagnosis: (\w+)",

which gives me the immediate word after Diagnosis: or diagnosis:. How can I make it work for 3 or more words?

 #@title Extract Diagnosis { form-width: "20%" }


 def extract_Diagnosis(clinical_information):
  PATTERNS = [
    "diagnosis: (\w+).",
    "Diagnosis: (\w+).",
    

     ]

 for pattern in PATTERNS:
    matches = re.findall(pattern, clinical_information)
    if len(matches) > 0:
        break

   Diagnosis = ''.join([t for t in matches if t.isalpha()])

   return Diagnosis

    for index, text in enumerate(texts):
     print(extract_Diagnosis(text))
      print("#"*79, index)

what I'm looking for is 3 or more words that come after diagnosis: or Diagnosis: in 20 pdfs. I've already turned the pdf to text and extracted the paragraph which "diagnosis:" is in (clinical information).

Upvotes: 0

Views: 105

Answers (1)

CryptoFool
CryptoFool

Reputation: 23089

Ok, a new answer that focuses more on the problems with your code than problems with your regular expression. So first of all, your regular expression needs to be tweaked just a little bit by removing the initial space character and changing 2 to 3:

diagnosis:\s+((?:\w+(?:\s+|$)){3})

Your code has a number of issues. Here's a version of your code that kinda works, although it may not be doing exactly what you want:

import re

def extract_Diagnosis(clinical_information):
    PATTERNS = [r"diagnosis:\s+((?:\w+(?:\s+|$)){3})"]
    matches = []
    for pattern in PATTERNS:
        matches = re.findall(pattern, clinical_information)
        if len(matches) > 0:
            break
    Diagnosis = ''.join([t for t in matches])
    return Diagnosis


texts = ["diagnosis: a b c    blah blah blah      diagnosis:   asdf asdf asdf  x x x "]

for index, text in enumerate(texts):
    print(extract_Diagnosis(text))
    print("#"*79, index)

Result:

a b c    asdf asdf asdf. 

Here are the things I fixed with your code:

  1. I replaced the two regular expressions with the one expression in your question, with the modifications mentioned above.
  2. I added a r to the front of the string constant containing the regular expression. This specifies a "raw string" in Python. You need to either do this or double up your backslashes.
  3. You were filtering your results with the expression if t.isalpha(). Given your expression, this will always be False because what you are matching will always contain spaces as well as word characters. I see no reason for this test anyway, since you know exactly what you're getting because what you get matched your regular expression.
  4. I fixed indentation so that everything worked. It may be that you had that right in your original code and it just got messed up moving it into your question.

I hope this helps!

Upvotes: 2

Related Questions