Lorenzo Pombini
Lorenzo Pombini

Reputation: 11

Missing labels in tokenized data set

I am implementing sentiment analysis, and when i tokenized my dataset.arff to use it as a training model, I have noticed that that is missing the label of "Positive"

the dataset.arff file Head:

relation SentimentAnalysis

@attribute text string @attribute classValues {Positive,Negative,Neutral,'Very positive','Very negative'}

@data 'some text',Positive 'some text',Negative 'some text',Neutral [...]

the tokenized data:


@relation 'SentimentAnalysis-weka.filters.unsupervised.attribute.StringToWordVector-R1-W5000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-stopwords-handlerweka.core.stopwords.Null-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'

@attribute classValues {Positive,Negative,Neutral,'Very positive','Very negative'}

all the words in the text as set as @attribute numeric

then i have the @data part

{0 Negative,20 1,220 1,228 [...] }
{0 Neutral,22 1,169 [...] }
{22 1,169 1,272 [...] }

########################################

this is just a a few lines example i have many more like that, i do not understand why is missing positive,

here my code for the tokenizer

public static void Tokenizer(){
           
            
            ArffLoader loader = new ArffLoader();
            String filename = "dataset.arff";
            File file = new File(filename);
            
            //checking if the file exist

            if (file.exists()){
                try {
                    loader.setFile(new File(filename));
                    Instances data = loader.getDataSet();
                    if (data.classIndex() == -1){
                        data.setClassIndex(data.numAttributes()- 1);
                    }

                    String[] options = new String[] {"-R", "1", "-W", "5000", "-prune-rate", "-1.0"};
                    StringToWordVector filter = new StringToWordVector();
                    filter.setOptions(options);
                    filter.setInputFormat(data);
                    
                    Instances filteredData = Filter.useFilter(data, filter);
                    
                    File tokenizedFile = new File("tokenizeDataSet.arff");


                    ArffSaver saver = new ArffSaver();
                    saver.setInstances(filteredData);
                    saver.setFile(tokenizedFile);
                    saver.writeBatch();

                    System.out.println("tokenized text saved properly");
                } catch (Exception e) {
                    System.out.println("Error in reloading the Arff file:"+ e.getMessage());
                    e.printStackTrace();
                }
                
            }    
    }

I was expecting to have the Positive label in the tokenize data set too!

Upvotes: 0

Views: 79

Answers (1)

fracpete
fracpete

Reputation: 2608

They weren't lost. Nominal values at the first position merely get suppressed when using the sparse ARFF format. See the Weka wiki on the sparse ARFF format.

Upvotes: 0

Related Questions