Reputation: 569
I need to improve on a regular expression I'm using. Currently, here it is:
^[a-zA-Z\s/-]+
I'm using it to pull out medication names from a variety of formulation strings, for example:
The resulting matches on these examples are:
The first four are what I want, but on the fifth, I really need "Amoxicillin / Clavulanate".
How would I pull out patterns like "Amoxicillin / Clavulanate" (in fifth row) while missing patterns like "MG/5 ML" (in the first row)?
Update
Thanks for the help, everyone. Here's a longer list of examples with more nuances of the data:
What I've done for now is this:
private static string GetMedNameFromIncomingConceptString(string conceptAsString)
{
// look for match at beginning of string
Match firstRegMatch = new Regex(@"^[a-zA-Z\s/-]+").Match(conceptAsString);
if (firstRegMatch.Success)
{
// grab matching part of string as whole string
string firstPart = conceptAsString.Substring(firstRegMatch.Index, firstRegMatch.Length);
// look for additional match following a hash (like Amox 1000 / Clav 50)
Match secondRegMatch = new Regex(@"/\s[a-zA-Z\s/-]+").Match(conceptAsString, firstRegMatch.Length);
if (secondRegMatch.Success)
return firstPart + conceptAsString.Substring(secondRegMatch.Index, secondRegMatch.Length);
else
return firstPart;
}
else
{
return conceptAsString;
}
}
It's pretty ugly, and I imagine it may fail when I run a lot more data through it, but it works for the larger set of cases I listed above.
Upvotes: 2
Views: 287
Reputation: 2761
Looking at the new data, the easiest, and arguably cleanest and robust way to do what you want is to first remove the usage (tablet, chewable, susp) and then to remove the dosages.
private static string GetMedNameFromIncomingConceptString(string conceptAsString) {
Regex compoundsAndDosages = new Regex(@".*[\s\d]m[gl]", RegexOptions.IgnoreCase);
Regex onlyDosage = new Regex(@"\s?[\d.-]+\s?m[gl][\/-]?", RegexOptions.IgnoreCase);
// keep compounds and dosage (= remove usage)
Match cad = compoundsAndDosages.Match(conceptAsString);
if (cad.Success) {
// remove dosages (= keep compunds)
return onlyDosage.Replace(cad.Value, "");
} else {
return conceptAsString;
}
}
Upvotes: 0
Reputation: 2761
The problem with your regex is that it stops matching as soon as it encounters a digit. The assumption is that once you have a dosage, you're done. However, the fifth example counters that assumption.
If you think about using regexes, consider this: How would you go explaining the rule for extracting medications for a regular Joe? Something like "Any and all strings containing letters or the characters / and -, except for the words mg, ml, oral, extended, release, tablet, chewable, po, susp." Sounds pretty difficult, considering it probably doesn't cover all cases.
If the examples are representative for your data, I do see a pattern. Assuming Perl:
/($compound+ $dosage)+ $usage/xi
where
$compound = qr/[a-z-] [\s\/]?/x;
$dosage = qr/(\/? [\d.-] \s (ml|mg))+/x; # add measurement units if needed
$usage = qr/.*/; # rest of string
Pretty hairy if you ask me, and I haven't tested it, only proven it correct. It would probably need some tweaking.
Edit: I see that you've added the tag .net
, but the regexes would look similar in C#.
Upvotes: 0
Reputation: 75242
When a slash is part of the dosage, is it always followed immediately by a digit? If so, this regex should do for you:
([A-Z]\D+)\d[^/]*(?:/\d[^/]*)*
It actively matches the dosage information as the others suggested, but captures only the medication name. Then you do a global replace for $1
to delete the dosage. Here's how I tested it in Java:
String[] data = {
"SULFAMETHOXAZOLE-TRIMETHOPRIM 200-40 MG/5ML PO SUSP",
"AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE",
"AMOXICILLIN TRIHYDRATE 125 mg ORAL TABLET, CHEWABLE",
"AMOX TR/POTASSIUM CLAVULANATE 125 mg-31.25 mg ORAL TABLET, CHEWABLE",
"Amoxicillin 1000 MG / Clavulanate 62.5 MG Extended Release Tablet"
};
Pattern p = Pattern.compile("([A-Z]\\D+)\\d[^/]*(?:/\\d[^/]*)*");
Matcher m = p.matcher("");
for (String s : data)
{
System.out.println(m.reset(s).replaceAll("$1"));
}
output:
SULFAMETHOXAZOLE-TRIMETHOPRIM
AMOX TR/POTASSIUM CLAVULANATE
AMOXICILLIN TRIHYDRATE
AMOX TR/POTASSIUM CLAVULANATE
Amoxicillin / Clavulanate
EDIT: Okay, it looks like the slash in the dosage is always followed by ML
, which may be preceded by a number, which may include a decimal point. Also, the dosage information may be missing entirely. This regex seems to yield the desired result for your expanded sample input:
([A-Z]\D+)(?:$|\d[^/]*(?:/[\d.]*ML[^/]*)*)
It should work in C#, too.
Upvotes: 1
Reputation: 12613
I think you would be better off removing words you know wont be part of the medication name such as oral
, numbers, etc. This should leave you with what you want.
Alternatively, if you have a database of medications, you can extract only words from that database, which should leave you with just the medications.
I realize these solutions don't use regular expressions, but I don't think they're up to the task you've set for them.
Upvotes: 0
Reputation: 799110
What you're asking for can't be done, since any attempt to do so would result in also picking up "PO SUSP", "ORAL TABLET", etc. What I recommend you do is try to pick up both the compound and the dosage, then strip off the dosage after.
Upvotes: 0