is punctuation kept in a bag of words?

I'm creating a bag of words module from the scratch. I'm not sure whether it's best practice in this approach whether to remove punctuation. Consider the sentence

I've been "DMX world center" for long time ago.Are u?

Question: For the bag of words, should I consider

In short, should I remove all the punctuation marks when getting distinct words?

Thanks in advance

Updated This is the code of what I have implemented

Sample text : ham , im .. On the snowboarding trip. I was wondering if your planning to get everyone together befor we go..a meet and greet kind of affair? Cheers,

   HashSet<String> bagOfWords = new HashSet<String>();
   BufferedReader reader = new BufferedReader(new FileReader(path));
   while (reader.ready()) {
       String msg = reader.readLine().split("\t", 2)[1].toLowerCase(); // I get only the 2nd part. 1st part indicate wether message is spam or ham
       String[] words = msg.split("[\\s+\n.\t!?+,]"); // this is the regex that I've used to split words
       for (String word : words) {
           bagOfWords.add(word);
       }
   }

Upvotes: 1

Views: 912

Answers (1)

Woody
Woody

Reputation: 939

Try replacing your code

 while (reader.ready()) {
       String msg = reader.readLine().split("\t", 2)[1].toLowerCase(); // I get only the 2nd part. 1st part indicate wether message is spam or ham
       String[] words = msg.split("[\\s+\n.\t!?+,]"); // this is the regex that I've used to split words
       for (String word : words) {
           bagOfWords.add(word.replaceAll("[!-+.^:,\"?]"," ").trim()); // it removes all sepecial characters what you mentioned
       }
   }

Upvotes: 2

Related Questions