Reputation: 43
Recently I have been experiencing some issues while splitting some medical text into sentences with spaCy. Maybe you can explain, why these issues arise?
If the word has a length of 1 and the sentence ends with a dot, the end of the sentence won't be recognized. For example:
There was no between-treatment difference in preoperative or postoperative hemodynamics or in release of troponin I. (NO SPLIT HERE) Preoperative oral coenzyme Q(10) therapy in patients undergoing cardiac surgery increases myocardial and cardiac mitochondrial coenzyme Q(10) levels, improves mitochondrial efficiency, and increases myocardial tolerance to in vitro hypoxia-reoxygenation stress.
Another issue is with the characters +/-
, which is treated as the end of a sentence. For instance one whole sentence is split into several sentences like below:
All of the above should be one single sentence!
Sometimes the sentence is interrupted between a word and a special character (special and special character, number and a word with a length less than 3).
The survival rates for patients receiving left ventricular assist devices (n = 68) versus patients receiving optimal medical management (n = 61) were 52% versus 28% at 1 year and 29% versus 13% at 2 years SPLITS HERE ( P = .008, log-rank test).
Thank you very much!
Upvotes: 4
Views: 548
Reputation: 15633
SpaCy's English models are trained on web data - mostly stuff like blog posts. Obviously the average blog post looks nothing like the medical literature you're working on, so spaCy is wildly confused. This problem isn't specific to spaCy, it will also happen with any system designed to work on "typical" English that doesn't include medical papers and uses statistical modelling.
Medical text is pretty notorious for having problems with NLP techniques that work in other circumstances, so you may want to look around for something specifically tailored for that. Alternately you can try making a small training set based on your data and making a new spaCy model.
That said, the +/-
issue does look strange, and might be based on a tokenization issue or something rather than a model issue - I would recommend you file a bug report here.
Upvotes: 2