Reputation: 777
I have a list of stop words which contain around 30 words and a set of articles .
I want to parse each article and remove those stop words from it .
I am not sure what is the most effecient way to do it.
for instance I can loop through stop list and replace the word in article if exist with whitespace but it does not seem good .
Thanks
Upvotes: 3
Views: 2345
Reputation: 26876
Replacing the words will be inefficient. Your best bet is probably to parse the article word by word, and copy each word to a new StringBuffer; unless it is a stopword, in which case you copy whatever you want in its place. StringBuffer is much more efficient than String here.
How you store the stopwords is probably unimportant if there are only thirty or so. A Set is probably a good bet.
Upvotes: 1
Reputation: 490408
Read a word from the input, and copy it to your StringBuilder (or wherever you're putting the result) if and only if it's not in the list of stop words. You'll be able to search for them faster if you put the stop words into something like a HashTable.
Edit: oops, don't know what I was thinking, but you want a set, not a HashTable (or any other Dictionary).
Upvotes: 0
Reputation: 30623
According to the Sun Java Tutorials, you can use the Perl-compatible \b
deliminator in your regular expressions. If you surround the word with them, it will match only that word, whether it's followed by or prefixed with a punctuation character or whitespace.
Upvotes: 0
Reputation: 346407
java.util.Set
Upvotes: 4