JavaLearner
JavaLearner

Reputation: 73

Removing stopwords from a String in Java

I have a string with lots of words and I have a text file which contains some Stopwords which I need to remove from my String. Let's say I have a String

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

After removing stopwords, string should be like :

"love phone, super fast much cool jelly bean....but recently bugs."

I have been able to achieve this but the problem I am facing is that whenver there are adjacent stopwords in the String its removing only the first and I am getting result as :

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"  

Here's my stopwordslist.txt file : Stopwords

How can I solve this problem. Here's what I have done so far :

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }

Upvotes: 7

Views: 33184

Answers (11)

user20405982
user20405982

Reputation: 1

private static void myStopWords(ArrayList stopWordCollection) {
    stopWordCollection.add("a");
    stopWordCollection.add("and");
    stopWordCollection.add("is");
    stopWordCollection.add("the");
    stopWordCollection.add("are");
    stopWordCollection.add("of");
    stopWordCollection.add("in");
    stopWordCollection.add("for");
    stopWordCollection.add("where");
    stopWordCollection.add("when");
    }
   private static void myStopWordRemoval(String text,ArrayList 
    list,ArrayList stopWordCollection) {

    List<String> list2=new ArrayList<String>();
    for(int i=0;i<list.size();i++) {
        for(int j=0;j<stopWordCollection.size();j++) {
            if(list.get(i).equals(stopWordCollection.get(j)))
                list2.add(list.get(i).toString());
        }
    }

Upvotes: 0

Uttesh Kumar
Uttesh Kumar

Reputation: 290

Recently one of the project required the functionality to filter the stopping/stemm and swear words from the given text or file, after going through the few blogs and write-ups. created a simple library to filter data/file and made available in maven. hope this may help some one.

https://github.com/uttesh/exude

     <dependency>
        <groupId>com.uttesh</groupId>
        <artifactId>exude</artifactId>
        <version>0.0.2</version>
    </dependency>

Upvotes: 0

Inquisitor
Inquisitor

Reputation: 1

It seems that you make a stop one stop word is removed in a sentence move to another stop word: you need to remove all stop words in each sentence.

You should try changing your code:

From:

for(int ii = 0; ii < wordsList.size(); ii++){
    for(int jj = 0; jj < k; jj++){
        if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
            wordsList.remove(ii);
            break;
        }
    }
}

To something like:

for(int ii = 0; ii < wordsList.size(); ii++)
{
    for(int jj = 0; jj < k; jj++)
    {
        if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
        {
            wordsList.remove(ii);
        }
    }
}

Note that break is removed and stopword.contains(word) is changed to word.contains(stopword).

Upvotes: 0

Navnath Chinchore
Navnath Chinchore

Reputation: 95

You can use replace All function like this

String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");

Upvotes: 3

Michal Lozinski
Michal Lozinski

Reputation: 101

Try storing the stopwords in a set collection, and than tokenise your string to a list. You can afterwards simply use 'removeAll' to get the result.

Set<String> stopwords = new Set<>()
//fill in the set with your file

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));

listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");

No for loops needed - they usually mean problems.

Upvotes: 1

geert3
geert3

Reputation: 7341

This is a much more elegant solution (IMHO), using only regular expressions:

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);

Upvotes: 5

alain.janinm
alain.janinm

Reputation: 20065

The error is because you remove element from the list you iterate on. Let says you have wordsList that contains |word0|word1|word2| If ii is equal to 1 and the if test is true, then you call wordsList.remove(1);. After that your list is |word0|word2|. ii is then incremented and is equal to 2 and now it's above the size of your list, hence word2 will never be tested.

From there there is several solutions. For example instead of removing values you can set value to "". Or create a special "result" list.

Upvotes: 3

robin
robin

Reputation: 1925

Try the program below.

String s="I love this phone, its super fast and there's so" +
            " much new and cool things with jelly bean....but of recently I've seen some bugs.";
    String[] words = s.split(" ");
    ArrayList<String> wordsList = new ArrayList<String>();
    Set<String> stopWordsSet = new HashSet<String>();
    stopWordsSet.add("I");
    stopWordsSet.add("THIS");
    stopWordsSet.add("AND");
    stopWordsSet.add("THERE'S");

    for(String word : words)
    {
        String wordCompare = word.toUpperCase();
        if(!stopWordsSet.contains(wordCompare))
        {
            wordsList.add(word);
        }
    }

    for (String str : wordsList){
        System.out.print(str+" ");
    }

OUTPUT: love phone, its super fast so much new cool things with jelly bean....but of recently I've seen some bugs.

Upvotes: 4

Darshan Lila
Darshan Lila

Reputation: 5868

Here's try it following way:

   String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
   String stopWords[]={"love","this","cool"};
   for(int i=0;i<stopWords.length;i++){
       if(s.contains(stopWords[i])){
           s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end
       }
   }
   System.out.println(s);

This way you final output will be without the words you don't want in it. Just get a list of stop words in an array and replace in required string.
Output for my stopwords:

I   phone, its super fast and there's so much new and  things with jelly bean....but of recently I've seen some bugs.

Upvotes: 1

Vimal Bera
Vimal Bera

Reputation: 10497

Instead why don't you use below approach. It will be easier to read and understand :

for(String word : words){
    s = s.replace(word+"\\s*", "");
}
System.out.println(s);//It will print removed word string.

Upvotes: 1

SMA
SMA

Reputation: 37073

Try using replaceAll api of String like:

String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", "");
System.out.println(afterStopWords);

OUTPUT: 
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.

Upvotes: 1

Related Questions