KNsiva
KNsiva

Reputation: 387

Stanford POS tagger in Java usage

Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)

These are the errors that I'm getting when I want to assign POS tags to sentences. I read sentences from a file. Initially (for few sentences) I'm not getting this error (i.e untokenizable), but after reading some sentences this error arises. I use v2.0 (i.e. 2009) of POS tagger and model is left3words.

Upvotes: 11

Views: 6405

Answers (4)

Adam_G
Adam_G

Reputation: 7879

I ran into this issue, as well. One way to test whether a character is tokenizable is to check whether it fails Character.isIdentifierIgnorable(). A character that is untokenizable will return true, while all tokenizable characters will return false.

Upvotes: 1

Rahul Kulhari
Rahul Kulhari

Reputation: 1165

If you are reading content from DOC, Portable Document Format(PDF) then Use Apache Tika. It Will extract your content. It might help you.

Apache Tika

About tika

Apache Tika is a toolkit for detecting and extracting meta data and structured text content from various documents using existing parser libraries. It is written in Java, but includes a command line version for use from other languages.

More information on Tika, the bug tracker, mailing lists, downloads and more are available at http://tika.apache.org/

Upvotes: 0

Christopher Manning
Christopher Manning

Reputation: 9450

I agree with Yuval -- a character encoding problem, but the commonest case is actually when the file is in a single byte encoding such as ISO-8859-1 while the tagger is trying to read it in UTF-8. See the discussion of U+FFFD on Wikipedia.

Upvotes: 8

Yuval F
Yuval F

Reputation: 20621

This looks like an encoding problem to me. Can you post the offending sentence? I couldn't find this in the documentation, but I would try checking if the file is in UTF-8 encoding.

Upvotes: 2

Related Questions