Shorbhaja
Shorbhaja

Reputation: 29

Most time efficient way to remove stop words in Java from an array of strings

How do I remove these stopwords in the most efficient way. The approach below doesn't remove the stopwords. What am I missing?

Is there any other way to do this?

I want to accomplish this in the most time efficient way in Java.

public static HashSet<String> hs = new HashSet<String>();


public static String[] stopwords = {"a", "able", "about",
        "across", "after", "all", "almost", "also", "am", "among", "an",
        "and", "any", "are", "as", "at", "b", "be", "because", "been",
        "but", "by", "c", "can", "cannot", "could", "d", "dear", "did",
        "do", "does", "e", "either", "else", "ever", "every", "f", "for",
        "from", "g", "get", "got", "h", "had", "has", "have", "he", "her",
        "hers", "him", "his", "how", "however", "i", "if", "in", "into",
        "is", "it", "its", "j", "just", "k", "l", "least", "let", "like",
        "likely", "m", "may", "me", "might", "most", "must", "my",
        "neither", "n", "no", "nor", "not", "o", "of", "off", "often",
        "on", "only", "or", "other", "our", "own", "p", "q", "r", "rather",
        "s", "said", "say", "says", "she", "should", "since", "so", "some",
        "t", "than", "that", "the", "their", "them", "then", "there",
        "these", "they", "this", "tis", "to", "too", "twas", "u", "us",
        "v", "w", "wants", "was", "we", "were", "what", "when", "where",
        "which", "while", "who", "whom", "why", "will", "with", "would",
        "x", "y", "yet", "you", "your", "z"};
public StopWords()
{
    int len= stopwords.length;
    for(int i=0;i<len;i++)
    {
        hs.add(stopwords[i]);
    }
    System.out.println(hs);
}

public List<String> removedText(List<String> S)
{
    Iterator<String> text = S.iterator();

    while(text.hasNext())
    {
        String token = text.next();
        if(hs.contains(token))
        {

                S.remove(text.next());
        }
        text = S.iterator();
    }
    return S;
}

Upvotes: 2

Views: 1887

Answers (4)

Martin
Martin

Reputation: 1350

I think that the most efficient way is use the binarySearch method with a sorted list of terms

int indexStop = Collections.binarySearch(EncyclopediaHelper.getStopWords(), string, String::compareToIgnoreCase);

boolean stop = indexStop > 0 

More information here: What is the performance of Collections.binarySearch over manually searching a list?

Upvotes: 0

ZaoTaoBao
ZaoTaoBao

Reputation: 2615

maybe you can use org/apache/commons/lang/ArrayUtils inside loop.

stopwords = ArrayUtils.removeElement(stopwords, element)

https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/ArrayUtils.html

Upvotes: 0

Mureinik
Mureinik

Reputation: 311883

You shouldn't manipulate the list while iterating over it. Moreover, you're calling next() twice under the same loop that evaluates hasNext(). Instead, you should use the iterator to remove the item:

public static List<String> removedText(List<String> s) {
    Iterator<String> text = s.iterator();

    while (text.hasNext()) {
        String token = text.next();
        if (hs.contains(token)) {
            text.remove();
        }
    }
    return s;
}

But that's a bit of "reinventing the wheel", instead, you could just use the removeAll(Collcetion) method:

s.removeAll(hs);

Upvotes: 2

LChukka
LChukka

Reputation: 11

Try the below changes suggested:

public static List<String> removedText(List<String> S)
{
    Iterator<String> text = S.iterator();

    while(text.hasNext())
    {
        String token = text.next();
        if(hs.contains(token))
        {

                S.remove(token); ////Changed text.next() --> token
        }
       // text = S.iterator(); why the need to re-assign?
    }
    return S;
}

Upvotes: -1

Related Questions