Pedro Domingues
Pedro Domingues

Reputation: 145

python regex matching only if the groups are between a special character

I'm working with a dataframe with some medicines and I want extract the dosages from a full sentence taken from the product description.

Example of what I want:

Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg
#['5mg/10mg','30mg/60mg']

Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack
#['120mg/20g/12mg']

I can extract the dosage using \d+(?:[.,]\d+)*\s*(g|mg|), which gets me:

Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg
#['5mg','10mg','30mg','60mg']

Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack
#['120mg','20g','12mg','10mg]

It would be easier to do this if / only happens once, but it can happen multiple times..

Upvotes: 0

Views: 42

Answers (1)

The fourth bird
The fourth bird

Reputation: 163362

You could get those matches using a pattern, and then after process it to remove the spaces and the hyphens

-?\b\d+(?:[.,]\d+)*\s*m?g(?:\s*/\s*-?\d+(?:[.,]\d+)*\s*m?g)+\b

Explanation

  • -? Match an optional hyphen
  • \b A word boundary to prevent a partial word match
  • \d+(?:[.,]\d+)* Match 1+ digits with optional decimal part
  • \s*m?g Match optional whitespace chars, optional m and g
  • (?: Non capture group to repeat as a whole
    • \s*/\s* Match / between optional whitespace chars
    • -?\d+(?:[.,]\d+)*\s*m?g Match the same digits pattern as before
  • )+ Close the non capture group and repeat 1+ times to match at least a part with a forward slash
  • \b A word boundary

See a regex demo and a Python demo.

Example

import re

pattern = r"-?\b\d+(?:[.,]\d+)*\s*m?g(?:\s*/\s*-?\d+(?:[.,]\d+)*\s*m?g)+\b"

strings = [
    "Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg",
    "Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack"
]

for s in strings:
    print([re.sub(r"[\s-]+", "", m) for m in re.findall(pattern, s)])

Output

['5mg/10mg', '30mg/60mg']
['120mg/20g/12mg']

Upvotes: 2

Related Questions