Reputation: 342
I'm trying to extract measurements from a messy dataset. Some basic example entries would be:
I'm trying to extract the measurement and units for each entry so the result for all the above should be: (1.5, g)
Some other questions proposed the use of NLTK for such a task but I'm running into trouble when doing the following:
import nltk
s1 = "1.5g of paracetamol"
s2 = "1.5 gram of paracetamol"
words_s1 = nltk.word_tokenize(s1)
words_s2 = nltk.word_tokenize(s2)
nltk.pos_tag(words_s1)
nltk.pos_tag(words_s2)
Which returns
[('1.5g', 'CD'), ('of', 'IN'), ('paracetamol', 'NN')]
[('1.5', 'CD'), ('gram', 'NN'), ('of', 'IN'), ('paracetamol', 'NN')]
The problem is that the unit 'g' is being kept as part of the CD in the first example. How could I get the following result?
[('1.5', 'CD'), ('g', 'NN'), ('of', 'IN'), ('paracetamol', 'NN')]
On the real data set the units are much more varied (mg, miligrams, kg, kgrams. ...)
Thanks!
Upvotes: 2
Views: 862
Reputation: 89567
You must tokenize the sentence yourself using nltk.regexp_tokenize
, for example:
words_s1 = nltk.regexp_tokenize(s1, r'(?u)\d+(?:\.\d+)?|\w+')
Obviously, it needs to be improved to deal with more complicated cases.
Upvotes: 2