Reputation: 73
I have a string with lots of words and I have a text file which contains some Stopwords which I need to remove from my String. Let's say I have a String
s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
After removing stopwords, string should be like :
"love phone, super fast much cool jelly bean....but recently bugs."
I have been able to achieve this but the problem I am facing is that whenver there are adjacent stopwords in the String its removing only the first and I am getting result as :
"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"
Here's my stopwordslist.txt file : Stopwords
How can I solve this problem. Here's what I have done so far :
int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
FileReader fr=new FileReader("F:\\stopwordslist.txt");
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null){
stopwords[k]=sCurrentLine;
k++;
}
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
StringBuilder builder = new StringBuilder(s);
String[] words = builder.toString().split("\\s");
for (String word : words){
wordsList.add(word);
}
for(int ii = 0; ii < wordsList.size(); ii++){
for(int jj = 0; jj < k; jj++){
if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
wordsList.remove(ii);
break;
}
}
}
for (String str : wordsList){
System.out.print(str+" ");
}
}catch(Exception ex){
System.out.println(ex);
}
Upvotes: 7
Views: 33184
Reputation: 1
private static void myStopWords(ArrayList stopWordCollection) {
stopWordCollection.add("a");
stopWordCollection.add("and");
stopWordCollection.add("is");
stopWordCollection.add("the");
stopWordCollection.add("are");
stopWordCollection.add("of");
stopWordCollection.add("in");
stopWordCollection.add("for");
stopWordCollection.add("where");
stopWordCollection.add("when");
}
private static void myStopWordRemoval(String text,ArrayList
list,ArrayList stopWordCollection) {
List<String> list2=new ArrayList<String>();
for(int i=0;i<list.size();i++) {
for(int j=0;j<stopWordCollection.size();j++) {
if(list.get(i).equals(stopWordCollection.get(j)))
list2.add(list.get(i).toString());
}
}
Upvotes: 0
Reputation: 290
Recently one of the project required the functionality to filter the stopping/stemm and swear words from the given text or file, after going through the few blogs and write-ups. created a simple library to filter data/file and made available in maven. hope this may help some one.
https://github.com/uttesh/exude
<dependency>
<groupId>com.uttesh</groupId>
<artifactId>exude</artifactId>
<version>0.0.2</version>
</dependency>
Upvotes: 0
Reputation: 1
It seems that you make a stop one stop word is removed in a sentence move to another stop word: you need to remove all stop words in each sentence.
You should try changing your code:
for(int ii = 0; ii < wordsList.size(); ii++){
for(int jj = 0; jj < k; jj++){
if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
wordsList.remove(ii);
break;
}
}
}
for(int ii = 0; ii < wordsList.size(); ii++)
{
for(int jj = 0; jj < k; jj++)
{
if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
{
wordsList.remove(ii);
}
}
}
Note that break
is removed and stopword.contains(word)
is changed to word.contains(stopword)
.
Upvotes: 0
Reputation: 95
You can use replace All function like this
String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");
Upvotes: 3
Reputation: 101
Try storing the stopwords in a set collection, and than tokenise your string to a list. You can afterwards simply use 'removeAll' to get the result.
Set<String> stopwords = new Set<>()
//fill in the set with your file
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));
listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");
No for loops needed - they usually mean problems.
Upvotes: 1
Reputation: 7341
This is a much more elegant solution (IMHO), using only regular expressions:
// instead of the ".....", add all your stopwords, separated by "|"
// "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
// the "\\s?" is to suppress optional trailing white space
Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
String s = m.replaceAll("");
System.out.println(s);
Upvotes: 5
Reputation: 20065
The error is because you remove element from the list you iterate on.
Let says you have wordsList
that contains |word0|word1|word2|
If ii
is equal to 1
and the if test is true, then you call wordsList.remove(1);
. After that your list is |word0|word2|
. ii
is then incremented and is equal to 2
and now it's above the size of your list, hence word2
will never be tested.
From there there is several solutions. For example instead of removing values you can set value to "". Or create a special "result" list.
Upvotes: 3
Reputation: 1925
Try the program below.
String s="I love this phone, its super fast and there's so" +
" much new and cool things with jelly bean....but of recently I've seen some bugs.";
String[] words = s.split(" ");
ArrayList<String> wordsList = new ArrayList<String>();
Set<String> stopWordsSet = new HashSet<String>();
stopWordsSet.add("I");
stopWordsSet.add("THIS");
stopWordsSet.add("AND");
stopWordsSet.add("THERE'S");
for(String word : words)
{
String wordCompare = word.toUpperCase();
if(!stopWordsSet.contains(wordCompare))
{
wordsList.add(word);
}
}
for (String str : wordsList){
System.out.print(str+" ");
}
OUTPUT: love phone, its super fast so much new cool things with jelly bean....but of recently I've seen some bugs.
Upvotes: 4
Reputation: 5868
Here's try it following way:
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords[]={"love","this","cool"};
for(int i=0;i<stopWords.length;i++){
if(s.contains(stopWords[i])){
s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end
}
}
System.out.println(s);
This way you final output will be without the words you don't want in it. Just get a list of stop words in an array and replace in required string.
Output for my stopwords:
I phone, its super fast and there's so much new and things with jelly bean....but of recently I've seen some bugs.
Upvotes: 1
Reputation: 10497
Instead why don't you use below approach. It will be easier to read and understand :
for(String word : words){
s = s.replace(word+"\\s*", "");
}
System.out.println(s);//It will print removed word string.
Upvotes: 1
Reputation: 37073
Try using replaceAll api of String like:
String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", "");
System.out.println(afterStopWords);
OUTPUT:
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.
Upvotes: 1