nltk tokenize measurement units

Question

I'm trying to extract measurements from a messy dataset. Some basic example entries would be:

1.5 gram of paracetamol
1.5g of paracetamol
1.5 grams. of paracetamol

I'm trying to extract the measurement and units for each entry so the result for all the above should be: (1.5, g)

Some other questions proposed the use of NLTK for such a task but I'm running into trouble when doing the following:

import nltk

s1 = "1.5g of paracetamol"
s2 = "1.5  gram of paracetamol"

words_s1 = nltk.word_tokenize(s1)
words_s2 = nltk.word_tokenize(s2)

nltk.pos_tag(words_s1)
nltk.pos_tag(words_s2)

Which returns

[('1.5g', 'CD'), ('of', 'IN'), ('paracetamol', 'NN')]
[('1.5', 'CD'), ('gram', 'NN'), ('of', 'IN'), ('paracetamol', 'NN')]

The problem is that the unit 'g' is being kept as part of the CD in the first example. How could I get the following result?

[('1.5', 'CD'), ('g', 'NN'), ('of', 'IN'), ('paracetamol', 'NN')]

On the real data set the units are much more varied (mg, miligrams, kg, kgrams. ...)

Thanks!

Casimir et Hippolyte · Accepted Answer

You must tokenize the sentence yourself using nltk.regexp_tokenize, for example:

words_s1 = nltk.regexp_tokenize(s1, r'(?u)\d+(?:\.\d+)?|\w+')

Obviously, it needs to be improved to deal with more complicated cases.

nltk tokenize measurement units

Answers (1)

Related Questions