Reputation: 387
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
These are the errors that I'm getting when I want to assign POS tags to sentences. I read sentences from a file. Initially (for few sentences) I'm not getting this error (i.e untokenizable), but after reading some sentences this error arises. I use v2.0 (i.e. 2009) of POS tagger and model is left3words
.
Upvotes: 11
Views: 6405
Reputation: 7879
I ran into this issue, as well. One way to test whether a character is tokenizable is to check whether it fails Character.isIdentifierIgnorable(). A character that is untokenizable will return true
, while all tokenizable characters will return false
.
Upvotes: 1
Reputation: 1165
If you are reading content from DOC, Portable Document Format(PDF) then Use Apache Tika. It Will extract your content. It might help you.
About tika
Apache Tika is a toolkit for detecting and extracting meta data and structured text content from various documents using existing parser libraries. It is written in Java, but includes a command line version for use from other languages.
More information on Tika, the bug tracker, mailing lists, downloads and more are available at http://tika.apache.org/
Upvotes: 0
Reputation: 9450
I agree with Yuval -- a character encoding problem, but the commonest case is actually when the file is in a single byte encoding such as ISO-8859-1 while the tagger is trying to read it in UTF-8. See the discussion of U+FFFD on Wikipedia.
Upvotes: 8
Reputation: 20621
This looks like an encoding problem to me. Can you post the offending sentence? I couldn't find this in the documentation, but I would try checking if the file is in UTF-8 encoding.
Upvotes: 2