Reputation: 5712
I'm creating a bag of words module from the scratch. I'm not sure whether it's best practice in this approach whether to remove punctuation. Consider the sentence
I've been "DMX world center" for long time ago.Are u?
Question: For the bag of words, should I consider
DMX
(no quotation mark) or "DMX
(which includes the left quotation mark)u
(without the question mark) or u?
(with the question mark)In short, should I remove all the punctuation marks when getting distinct words?
Thanks in advance
Updated This is the code of what I have implemented
Sample text : ham , im .. On the snowboarding trip. I was wondering if your planning to get everyone together befor we go..a meet and greet kind of affair? Cheers,
HashSet<String> bagOfWords = new HashSet<String>();
BufferedReader reader = new BufferedReader(new FileReader(path));
while (reader.ready()) {
String msg = reader.readLine().split("\t", 2)[1].toLowerCase(); // I get only the 2nd part. 1st part indicate wether message is spam or ham
String[] words = msg.split("[\\s+\n.\t!?+,]"); // this is the regex that I've used to split words
for (String word : words) {
bagOfWords.add(word);
}
}
Upvotes: 1
Views: 912
Reputation: 939
Try replacing your code
while (reader.ready()) {
String msg = reader.readLine().split("\t", 2)[1].toLowerCase(); // I get only the 2nd part. 1st part indicate wether message is spam or ham
String[] words = msg.split("[\\s+\n.\t!?+,]"); // this is the regex that I've used to split words
for (String word : words) {
bagOfWords.add(word.replaceAll("[!-+.^:,\"?]"," ").trim()); // it removes all sepecial characters what you mentioned
}
}
Upvotes: 2