Reputation: 569

Pulling out two separate words from a string using reg expressions?

I need to improve on a regular expression I'm using. Currently, here it is:

^[a-zA-Z\s/-]+

I'm using it to pull out medication names from a variety of formulation strings, for example:

SULFAMETHOXAZOLE-TRIMETHOPRIM 200-40 MG/5ML PO SUSP
AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE
AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet

The resulting matches on these examples are:

SULFAMETHOXAZOLE-TRIMETHOPRIM
AMOX TR/POTASSIUM CLAVULANATE
AMOXICILLIN TRIHYDRATE
AMOX TR/POTASSIUM CLAVULANATE
Amoxicillin

The first four are what I want, but on the fifth, I really need "Amoxicillin / Clavulanate".

How would I pull out patterns like "Amoxicillin / Clavulanate" (in fifth row) while missing patterns like "MG/5 ML" (in the first row)?

Update

Thanks for the help, everyone. Here's a longer list of examples with more nuances of the data:

Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet
Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet
Amoxicillin 10 MG/ML Oral Suspension
Amoxil 10 MG/ML Oral Suspension
AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE
AMOXAPINE
AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE
AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE
CARBATROL 200 MG PO CP12
CARBATROL 200 MG PO CP12
CARBATROL
CARBAMAZEPINE 100 MG PO CHEW
CEFDINIR 250 MG/5ML PO SUSR
AMOXICILLIN 400 MG/5ML PO SUSR
SULFAMETHOXAZOLE-TRIMETHOPRIM 200-40 MG/5ML PO SUSP
DIAZEPAM 2 MG PO TABS
DIAZEPAM
PREDNISONE 20 MG PO TABS
AUGMENTIN 250-62.5 MG/5ML PO SUSR
ACETAMINOPHEN 325 MG/10.15ML PO SUSP

What I've done for now is this:

    private static string GetMedNameFromIncomingConceptString(string conceptAsString)
    {
        // look for match at beginning of string
        Match firstRegMatch = new Regex(@"^[a-zA-Z\s/-]+").Match(conceptAsString);
        if (firstRegMatch.Success)
        {
            // grab matching part of string as whole string
            string firstPart = conceptAsString.Substring(firstRegMatch.Index, firstRegMatch.Length);

            // look for additional match following a hash (like Amox 1000 / Clav 50)
            Match secondRegMatch = new Regex(@"/\s[a-zA-Z\s/-]+").Match(conceptAsString, firstRegMatch.Length);
            if (secondRegMatch.Success) 
                return firstPart + conceptAsString.Substring(secondRegMatch.Index, secondRegMatch.Length);
            else
                return firstPart;
        }
        else
        {
            return conceptAsString;
        }
    }

It's pretty ugly, and I imagine it may fail when I run a lot more data through it, but it works for the larger set of cases I listed above.

Upvotes: 2

Answers (5)

Zano

Reputation: 2761

Looking at the new data, the easiest, and arguably cleanest and robust way to do what you want is to first remove the usage (tablet, chewable, susp) and then to remove the dosages.

private static string GetMedNameFromIncomingConceptString(string conceptAsString) {   
   Regex compoundsAndDosages = new Regex(@".*[\s\d]m[gl]", RegexOptions.IgnoreCase);
   Regex onlyDosage = new Regex(@"\s?[\d.-]+\s?m[gl][\/-]?", RegexOptions.IgnoreCase);

   // keep compounds and dosage (= remove usage)
   Match cad = compoundsAndDosages.Match(conceptAsString); 
   if (cad.Success) {
      // remove dosages (= keep compunds)
      return onlyDosage.Replace(cad.Value, ""); 
   } else {
      return conceptAsString;
   }
}

Upvotes: 0

Zano

Reputation: 2761

The problem with your regex is that it stops matching as soon as it encounters a digit. The assumption is that once you have a dosage, you're done. However, the fifth example counters that assumption.

If you think about using regexes, consider this: How would you go explaining the rule for extracting medications for a regular Joe? Something like "Any and all strings containing letters or the characters / and -, except for the words mg, ml, oral, extended, release, tablet, chewable, po, susp." Sounds pretty difficult, considering it probably doesn't cover all cases.

If the examples are representative for your data, I do see a pattern. Assuming Perl:

/($compound+ $dosage)+ $usage/xi

where

$compound = qr/[a-z-] [\s\/]?/x;
$dosage = qr/(\/? [\d.-] \s (ml|mg))+/x; # add measurement units if needed
$usage = qr/.*/; # rest of string

Pretty hairy if you ask me, and I haven't tested it, ~~only proven it correct~~. It would probably need some tweaking.

Edit: I see that you've added the tag .net, but the regexes would look similar in C#.

Upvotes: 0

Alan Moore

Reputation: 75242

When a slash is part of the dosage, is it always followed immediately by a digit? If so, this regex should do for you:

([A-Z]\D+)\d[^/]*(?:/\d[^/]*)*

It actively matches the dosage information as the others suggested, but captures only the medication name. Then you do a global replace for $1 to delete the dosage. Here's how I tested it in Java:

String[] data = { 
  "SULFAMETHOXAZOLE-TRIMETHOPRIM 200-40 MG/5ML PO SUSP",
  "AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE",
  "AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE",
  "AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE",
  "Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet"
};
Pattern p = Pattern.compile("([A-Z]\\D+)\\d[^/]*(?:/\\d[^/]*)*");
Matcher m = p.matcher("");
for (String s : data)
{
  System.out.println(m.reset(s).replaceAll("$1"));
}

output:

SULFAMETHOXAZOLE-TRIMETHOPRIM
AMOX TR/POTASSIUM CLAVULANATE
AMOXICILLIN TRIHYDRATE
AMOX TR/POTASSIUM CLAVULANATE
Amoxicillin / Clavulanate

EDIT: Okay, it looks like the slash in the dosage is always followed by ML, which may be preceded by a number, which may include a decimal point. Also, the dosage information may be missing entirely. This regex seems to yield the desired result for your expanded sample input:

([A-Z]\D+)(?:$|\d[^/]*(?:/[\d.]*ML[^/]*)*)

It should work in C#, too.

Upvotes: 1

David Kanarek

Reputation: 12613

I think you would be better off removing words you know wont be part of the medication name such as oral, numbers, etc. This should leave you with what you want.

Alternatively, if you have a database of medications, you can extract only words from that database, which should leave you with just the medications.

I realize these solutions don't use regular expressions, but I don't think they're up to the task you've set for them.

Upvotes: 0

Ignacio Vazquez-Abrams

Reputation: 799110

What you're asking for can't be done, since any attempt to do so would result in also picking up "PO SUSP", "ORAL TABLET", etc. What I recommend you do is try to pick up both the compound and the dosage, then strip off the dosage after.

Upvotes: 0

Pulling out two separate words from a string using reg expressions?

Answers (5)

Related Questions