Reputation: 609
I want to calculate stringToWordVector of my dataset on WEKA application. I update parameter of wordsToKeep by 50. But It calculates 78 words. I want 50 words but it calculates 78 words. How can I correct the calculation?
My data set : http://www.dt.fee.unicamp.br/~tiago/smsspamcollection - Link1
Upvotes: 1
Views: 73
Reputation: 66805
-W
option restricts number of words to keep per class, thus for 2 classes setting -W 50 gives you limit of a 100
from source:
public String wordsToKeepTipText() {
return "The number of words (per class if there is a class attribute "+
"assigned) to attempt to keep.";
}
Furthermore, based on a source, it is not a strict constraint and it only affects where to prune the sorted occurences list, this can be altered
// sort the array
sortArray(array);
if (array.length < m_WordsToKeep) {
// if there aren't enough words, set the threshold to
// minFreq
prune[z] = m_minTermFreq;
} else {
// otherwise set it to be at least minFreq
prune[z] = Math.max(m_minTermFreq,
array[array.length - m_WordsToKeep]);
}
Upvotes: 2