Reputation: 569
Hey I am doing a project in which I have to remove the stopwords (or rather certain words, i have a list of about 560 of them) from tweets,I was using below code :
tweet= tweet.replaceAll(' '+stopword+' ', "");
But here is problem as first word can also be stopword, so how to handle if first word of the tweet is a stopword, if u are thinking
text = text.replaceAll(stopword+' ', "");
Then this wont work because some stopwords are ending characters of a word, so please give a solution for these. Thanks in advance
Upvotes: 0
Views: 109
Reputation: 140494
Use the word break boundary matcher:
"\\b" + Pattern.quote(stopword) + "\\b"
This matches word breaks, which includes spaces, start/end of string, punctuation etc. See the doc for java.util.Pattern for more details.
I also put in that the stopword should be quoted, since it looks like a variable, and thus shouldn't be trusted to contain a valid regex.
Upvotes: 3