Elham
Elham

Reputation: 777

remove Stopwords in java

I have a list of stop words which contain around 30 words and a set of articles .

I want to parse each article and remove those stop words from it .

I am not sure what is the most effecient way to do it.

for instance I can loop through stop list and replace the word in article if exist with whitespace but it does not seem good .

Thanks

Upvotes: 3

Views: 2345

Answers (4)

DJClayworth
DJClayworth

Reputation: 26876

Replacing the words will be inefficient. Your best bet is probably to parse the article word by word, and copy each word to a new StringBuffer; unless it is a stopword, in which case you copy whatever you want in its place. StringBuffer is much more efficient than String here.

How you store the stopwords is probably unimportant if there are only thirty or so. A Set is probably a good bet.

Upvotes: 1

Jerry Coffin
Jerry Coffin

Reputation: 490408

Read a word from the input, and copy it to your StringBuilder (or wherever you're putting the result) if and only if it's not in the list of stop words. You'll be able to search for them faster if you put the stop words into something like a HashTable.

Edit: oops, don't know what I was thinking, but you want a set, not a HashTable (or any other Dictionary).

Upvotes: 0

amphetamachine
amphetamachine

Reputation: 30623

According to the Sun Java Tutorials, you can use the Perl-compatible \b deliminator in your regular expressions. If you surround the word with them, it will match only that word, whether it's followed by or prefixed with a punctuation character or whitespace.

Upvotes: 0

Michael Borgwardt
Michael Borgwardt

Reputation: 346407

  • Put stop words into a java.util.Set
  • Split input into words
  • For each word in input, see if it's contained in the set of stopwords, write to output if not

Upvotes: 4

Related Questions