Ozturk
Ozturk

Reputation: 609

Why weka calculates wrong number function of stringToWordVector on WEKA?

I want to calculate stringToWordVector of my dataset on WEKA application. I update parameter of wordsToKeep by 50. But It calculates 78 words. I want 50 words but it calculates 78 words. How can I correct the calculation?

My data set : http://www.dt.fee.unicamp.br/~tiago/smsspamcollection - Link1

Upvotes: 1

Views: 73

Answers (1)

lejlot
lejlot

Reputation: 66805

-W option restricts number of words to keep per class, thus for 2 classes setting -W 50 gives you limit of a 100

from source:

public String wordsToKeepTipText() {
    return "The number of words (per class if there is a class attribute "+
    "assigned) to attempt to keep.";
  }

Furthermore, based on a source, it is not a strict constraint and it only affects where to prune the sorted occurences list, this can be altered

// sort the array
sortArray(array);
if (array.length < m_WordsToKeep) {
// if there aren't enough words, set the threshold to
// minFreq
prune[z] = m_minTermFreq;
  } else {
// otherwise set it to be at least minFreq
prune[z] = Math.max(m_minTermFreq, 
    array[array.length - m_WordsToKeep]);
  }

Upvotes: 2

Related Questions