amibar
amibar

Reputation: 31

Stanford word segmenter

I 'm using stanford word segmenter. But I have a problem with it.

I type the command:

$ C:\Users\toshiba\workspace\SegDemo\stanford-segmenter-2013-06-20>java -cp seg.jar;stanford-segmenter-3.2.0-javadoc.jar;stanford-segmenter-3.2.0-sources.jar -mx1g edu.stanford.nlp.international.arabic.process.ArabicSegmenter -loadClassifier data/arabic-segmenter-atbtrain.ser.gz -textFile phrase.txt > phrase.txt.segmented 

And I have the following process:

Loaded ArabicTokenizer with options: null
loadClassifier=data/arabic-segmenter-atbtrain.ser.gz
textFile=phrase.txt
featureFactory=edu.stanford.nlp.international.arabic.process.ArabicSegmenterFeat
ureFactory
loadClassifier=data/arabic-segmenter-atbtrain.ser.gz
textFile=phrase.txt
featureFactory=edu.stanford.nlp.international.arabic.process.ArabicSegmenterFeat
ureFactory
Loading classifier from C:\Users\toshiba\workspace\SegDemo\stanford-segmenter-20
13-06-20\data\arabic-segmenter-atbtrain.ser.gz ... done [1,2 sec].
Untokenizable: ?
Done! Processed input text at 475,13 input characters/second

I don't understand "Untokenizale: ?"

Should the sentence be transliterated before processing in segmentation?

Upvotes: 1

Views: 1788

Answers (2)

dbl
dbl

Reputation: 163

I haven't tried this with the segmenter, but I've seen this with the tokenizer from time to time. Using "-options untokenizable=noneKeep" works for PTBTokenizer; maybe it will work for the segmenter as well.

Here's what http://nlp.stanford.edu/software/tokenizer.shtml has to say about the untokenizable options:

untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is "firstDelete".

Upvotes: 0

mbatchkarov
mbatchkarov

Reputation: 16039

I often get the same warning, for example:

WARNING: Untokenizable: ₪ (U+20AA, decimal: 8362)

I have two theories as to what causes this:

  1. somewhere in the text there is a character that cannot be encoded by the current encoding (Stanford uses UTF-8 by default, but you can change that with the -encoding flag)
  2. Stanford does not know how to tokenise a word containing a very special character.

In either case, this is nothing to worry about. If you are only getting one warning for your whole input data, then the worst thing that can happen is the tokenizer might ignore a small portion of a sentence.

As an aside, Joel's article on Unicode is a very good starting place if you want to know more about character encodings.

Upvotes: 1

Related Questions