Reputation: 145
I'm working with a dataframe with some medicines and I want extract the dosages from a full sentence taken from the product description.
Example of what I want:
Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg
#['5mg/10mg','30mg/60mg']
Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack
#['120mg/20g/12mg']
I can extract the dosage using \d+(?:[.,]\d+)*\s*(g|mg|)
, which gets me:
Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg
#['5mg','10mg','30mg','60mg']
Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack
#['120mg','20g','12mg','10mg]
It would be easier to do this if /
only happens once, but it can happen multiple times..
Upvotes: 0
Views: 42
Reputation: 163362
You could get those matches using a pattern, and then after process it to remove the spaces and the hyphens
-?\b\d+(?:[.,]\d+)*\s*m?g(?:\s*/\s*-?\d+(?:[.,]\d+)*\s*m?g)+\b
Explanation
-?
Match an optional hyphen\b
A word boundary to prevent a partial word match\d+(?:[.,]\d+)*
Match 1+ digits with optional decimal part\s*m?g
Match optional whitespace chars, optional m
and g
(?:
Non capture group to repeat as a whole
\s*/\s*
Match /
between optional whitespace chars-?\d+(?:[.,]\d+)*\s*m?g
Match the same digits pattern as before)+
Close the non capture group and repeat 1+ times to match at least a part with a forward slash\b
A word boundarySee a regex demo and a Python demo.
Example
import re
pattern = r"-?\b\d+(?:[.,]\d+)*\s*m?g(?:\s*/\s*-?\d+(?:[.,]\d+)*\s*m?g)+\b"
strings = [
"Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg",
"Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack"
]
for s in strings:
print([re.sub(r"[\s-]+", "", m) for m in re.findall(pattern, s)])
Output
['5mg/10mg', '30mg/60mg']
['120mg/20g/12mg']
Upvotes: 2