Reputation: 31
I 'm using stanford word segmenter. But I have a problem with it.
I type the command:
$ C:\Users\toshiba\workspace\SegDemo\stanford-segmenter-2013-06-20>java -cp seg.jar;stanford-segmenter-3.2.0-javadoc.jar;stanford-segmenter-3.2.0-sources.jar -mx1g edu.stanford.nlp.international.arabic.process.ArabicSegmenter -loadClassifier data/arabic-segmenter-atbtrain.ser.gz -textFile phrase.txt > phrase.txt.segmented
And I have the following process:
Loaded ArabicTokenizer with options: null
loadClassifier=data/arabic-segmenter-atbtrain.ser.gz
textFile=phrase.txt
featureFactory=edu.stanford.nlp.international.arabic.process.ArabicSegmenterFeat
ureFactory
loadClassifier=data/arabic-segmenter-atbtrain.ser.gz
textFile=phrase.txt
featureFactory=edu.stanford.nlp.international.arabic.process.ArabicSegmenterFeat
ureFactory
Loading classifier from C:\Users\toshiba\workspace\SegDemo\stanford-segmenter-20
13-06-20\data\arabic-segmenter-atbtrain.ser.gz ... done [1,2 sec].
Untokenizable: ?
Done! Processed input text at 475,13 input characters/second
I don't understand "Untokenizale: ?"
Should the sentence be transliterated before processing in segmentation?
Upvotes: 1
Views: 1788
Reputation: 163
I haven't tried this with the segmenter, but I've seen this with the tokenizer from time to time. Using "-options untokenizable=noneKeep" works for PTBTokenizer; maybe it will work for the segmenter as well.
Here's what http://nlp.stanford.edu/software/tokenizer.shtml has to say about the untokenizable options:
untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is "firstDelete".
Upvotes: 0
Reputation: 16039
I often get the same warning, for example:
WARNING: Untokenizable: ₪ (U+20AA, decimal: 8362)
I have two theories as to what causes this:
-encoding
flag)In either case, this is nothing to worry about. If you are only getting one warning for your whole input data, then the worst thing that can happen is the tokenizer might ignore a small portion of a sentence.
As an aside, Joel's article on Unicode is a very good starting place if you want to know more about character encodings.
Upvotes: 1