Reputation: 23
I am new to WEKA and I want to ask you few questions regarding WEKA. I had follow this tutorial (Named Entity Recognition using WEKA).
But I am really confusing and have no idea at all.
For example in my .ARFF file:
@attribute text string
@attribute tag {CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, LS, MD, NN, NNS, NNP, NNPS, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD , VBG, VBN , VBP, VBZ, WDT, WP, WP$, WRB, ,, ., :}
@attribute capital {Y, N}
@attribute chunked {B-NP, I-NP, B-VP, I-VP, B-PP, I-PP, B-ADJP, B-ADVP , B-SBAR, B-PRT, O-Punctuation}
@attribute @@class@@ {B-PER, I-PER, B-ORG, I-ORG, B-NUM, I-NUM, O, B-LOC, I-LOC}
@data
'Wanna',NNP,Y,B-NP,O
'be',VB,N,B-VP,O
'like',IN,N,B-PP,O
'New',NNP,Y,B-NP,B-LOC
'York',NNP,Y,I-NP,I-LOC
'?',.,N,O-Punctuation,O
So, when I filtered the String, it tokenized the string into word but what I want is, I want to tokenize/filter the string according to the phrase. For example extract the phrase "New York" not "New" and "York" according to the chunked attributes.
"B-NP" means start phrase and "I-NP" means next phrase (the middle or end of the phrase).
B-PER and I-PER to the class name PERSON?
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0 0.021 0 0 0 0.768 B-PER
1 0.084 0.333 1 0.5 0.963 I-PER
0.167 0.054 0.167 0.167 0.167 0.313 B-ORG
0 0 0 0 0 0.964 I-ORG
0 0 0 0 0 0.281 B-NUM
0 0 0 0 0 0.148 I-NUM
0.972 0.074 0.972 0.972 0.972 0.949 O
0.875 0 1 0.875 0.933 0.977 B-LOC
0 0 0 0 0 0.907 I-LOC
Weighted Avg. 0.828 0.061 0.811 0.828 0.813 0.894
Upvotes: 2
Views: 1477
Reputation: 750
In my opinion, WEKA won't (currently) be the best machine learning software to do NER... as far as I know, WEKA does classify sets of examples, for NER it may be done either:
In both cases, contiguity is not taken into account, which is really disturbing. Also, as far as I know, this is the same for R (?). This why "sequence labelling" (NER, morpho-syntax, syntax and dependencies) are usually done using software that determines a token category using current word, but also previous, next word, etc. and can output single tokens but also multitoken expressions or more complicated structures.
For NER, currently, CRF are usually used for that, see:
Upvotes: 3