Reputation: 26870
Lets say I have this list of words:
String[] stopWords = new String[]{"i","a","and","about","an","are","as","at","be","by","com","for","from","how","in","is","it","not","of","on","or","that","the","this","to","was","what","when","where","who","will","with","the","www"};
Than I have text
String text = "I would like to do a nice novel about nature AND people"
Is there method that matches the stopWords and removes them while ignoring case; like this somewhere out there?:
String noStopWordsText = remove(text, stopWords);
Result:
" would like do nice novel nature people"
If you know about regex that wold work great but I would really prefer something like commons solution that is bit more performance oriented.
BTW, right now I'm using this commons method which is lacking proper insensitive case handling:
private static final String[] stopWords = new String[]{"i", "a", "and", "about", "an", "are", "as", "at", "be", "by", "com", "for", "from", "how", "in", "is", "it", "not", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", "where", "who", "will", "with", "the", "www", "I", "A", "AND", "ABOUT", "AN", "ARE", "AS", "AT", "BE", "BY", "COM", "FOR", "FROM", "HOW", "IN", "IS", "IT", "NOT", "OF", "ON", "OR", "THAT", "THE", "THIS", "TO", "WAS", "WHAT", "WHEN", "WHERE", "WHO", "WILL", "WITH", "THE", "WWW"};
private static final String[] blanksForStopWords = new String[]{"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""};
noStopWordsText = StringUtils.replaceEach(text, stopWords, blanksForStopWords);
Upvotes: 7
Views: 15432
Reputation: 132972
This is a solution that does not use regular expressions. I think it's inferior to my other answer because it is much longer and less clear, but if performance is really, really important then this is O(n) where n is the length of the text.
Set<String> stopWords = new HashSet<String>();
stopWords.add("a");
stopWords.add("and");
// and so on ...
String sampleText = "I would like to do a nice novel about nature AND people";
StringBuffer clean = new StringBuffer();
int index = 0;
while (index < sampleText.length) {
// the only word delimiter supported is space, if you want other
// delimiters you have to do a series of indexOf calls and see which
// one gives the smallest index, or use regex
int nextIndex = sampleText.indexOf(" ", index);
if (nextIndex == -1) {
nextIndex = sampleText.length - 1;
}
String word = sampleText.substring(index, nextIndex);
if (!stopWords.contains(word.toLowerCase())) {
clean.append(word);
if (nextIndex < sampleText.length) {
// this adds the word delimiter, e.g. the following space
clean.append(sampleText.substring(nextIndex, nextIndex + 1));
}
}
index = nextIndex + 1;
}
System.out.println("Stop words removed: " + clean.toString());
Upvotes: 4
Reputation: 41127
Split text
on whilespace. Then loop through the array and keep appending to a StringBuilder
only if it is not one of the stop words.
Upvotes: 1
Reputation: 240996
You can make a reg expression to match all the stop words [for example a
, note space here]and end up with
str.replaceAll(regexpression,"");
OR
String[] stopWords = new String[]{" i ", " a ", " and ", " about ", " an ", " are ", " as ", " at ", " be ", " by ", " com ", " for ", " from ", " how ", " in ", " is ", " it ", " not ", " of ", " on ", " or ", " that ", " the ", " this ", " to ", " was ", " what ", " when ", " where ", " who ", " will ", " with ", " the ", " www "};
String text = " I would like to do a nice novel about nature AND people ";
for (String stopword : stopWords) {
text = text.replaceAll("(?i)"+stopword, " ");
}
System.out.println(text);
output:
would like do nice novel nature people
There might be better way.
Upvotes: 5
Reputation: 132972
Create a regular expression with your stop words, make it case insensitive, and then use the matcher's replaceAll
method to replace all matches with an empty string
import java.util.regex.*;
Pattern stopWords = Pattern.compile("\\b(?:i|a|and|about|an|are|...)\\b\\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = stopWords.matcher("I would like to do a nice novel about nature AND people");
String clean = matcher.replaceAll("");
the ...
in the pattern is just me being lazy, continue the list of stop words.
Another method is to loop over all the stop words and use String
's replaceAll
method. The problem with that approach is that replaceAll
will compile a new regular expression for each call, so it's not very efficient to use in loops. Also, you can't pass the flag that makes the regular expression case insensitive when you use String
's replaceAll
.
Edit: I added \b
around the pattern to make it match whole words only. I also added \s*
to make it glob up any spaces after, that's maybe not necessary.
Upvotes: 17