Reputation: 11
I am implementing sentiment analysis, and when i tokenized my dataset.arff to use it as a training model, I have noticed that that is missing the label of "Positive"
the dataset.arff file Head:
relation SentimentAnalysis
@attribute text string @attribute classValues {Positive,Negative,Neutral,'Very positive','Very negative'}
@data 'some text',Positive 'some text',Negative 'some text',Neutral [...]
the tokenized data:
@relation 'SentimentAnalysis-weka.filters.unsupervised.attribute.StringToWordVector-R1-W5000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-stopwords-handlerweka.core.stopwords.Null-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'
@attribute classValues {Positive,Negative,Neutral,'Very positive','Very negative'}
all the words in the text as set as @attribute numeric
then i have the @data part
{0 Negative,20 1,220 1,228 [...] }
{0 Neutral,22 1,169 [...] }
{22 1,169 1,272 [...] }
########################################
this is just a a few lines example i have many more like that, i do not understand why is missing positive,
here my code for the tokenizer
public static void Tokenizer(){
ArffLoader loader = new ArffLoader();
String filename = "dataset.arff";
File file = new File(filename);
//checking if the file exist
if (file.exists()){
try {
loader.setFile(new File(filename));
Instances data = loader.getDataSet();
if (data.classIndex() == -1){
data.setClassIndex(data.numAttributes()- 1);
}
String[] options = new String[] {"-R", "1", "-W", "5000", "-prune-rate", "-1.0"};
StringToWordVector filter = new StringToWordVector();
filter.setOptions(options);
filter.setInputFormat(data);
Instances filteredData = Filter.useFilter(data, filter);
File tokenizedFile = new File("tokenizeDataSet.arff");
ArffSaver saver = new ArffSaver();
saver.setInstances(filteredData);
saver.setFile(tokenizedFile);
saver.writeBatch();
System.out.println("tokenized text saved properly");
} catch (Exception e) {
System.out.println("Error in reloading the Arff file:"+ e.getMessage());
e.printStackTrace();
}
}
}
I was expecting to have the Positive label in the tokenize data set too!
Upvotes: 0
Views: 79
Reputation: 2608
They weren't lost. Nominal values at the first position merely get suppressed when using the sparse ARFF format. See the Weka wiki on the sparse ARFF format.
Upvotes: 0