Tim51092
Tim51092

Reputation: 201

is this use of objects redundant and/or inefficient?

I'm fairly inexperienced with using objects so I would really like some input.

I'm trying to remove comments from a list that have certain "unwanted words" in them, both the comments and the list of "unwanted words" are in ArrayList objects.

This is inside of a class called FormHelper, which contains the private member comments as an ArrayList, the auditList ArrayList is created locally in a member function called populateComments(), which then calls this function (below). PopulateComments() is called by the constructor, and so this function only gets called once, when an instance of FormHelper is created.

private void filterComments(ArrayList <String> auditList) {
    for(String badWord : auditList) {
        for (String thisComment : this.comments) {
            if(thisComment.contains(badWord)) {
                int index = this.comments.indexOf(thisComment);
                this.comments.remove(index);
            }
        }
    }
}

something about the way I implemented this doesn't feel right, I'm also concerned that I'm using ArrayList functions inefficiently. Is my suspicion correct?

Upvotes: 0

Views: 142

Answers (2)

Andy Turner
Andy Turner

Reputation: 140299

As a general approach, removing individual elements from an ArrayList in a loop is inefficient, because it requires shifting all of the "following" elements along one position in the array.

A B C D E
  ^        if you remove this
    ^---^ you have to shift these 3 along by one
   / / /
A C D E

If you remove lots of elements, this will have a substantial impact on the time complexity. It's better to identify the elements to remove, and then remove them all at once.

I suggest that a neater way to do this would be using removeIf, which (at least for collection implementations such as ArrayList) does this "all at once" removal:

this.comments.removeIf(
    c -> auditList.stream().anyMatch(c::contains));

This is concise, but probably quite slow because it has to keep checking the entire comment string to see if it contains each bad word.

A probably faster way would be to use regex:

Pattern p = Pattern.compile(
    auditList.stream()
        .map(Pattern::quote)
        .collect(joining("|")));
this.comments.removeIf(
    c -> p.matcher(c).find());

This would be better because the compiled regex would search for all of the bad words in a single pass over each comment.

The other advantage of a regex-based approach is that you can check case insensitively, by supplying the appropriate flag when compiling the regex.

Upvotes: 1

Stephen C
Stephen C

Reputation: 718658

It is not particularly efficient. However, finding a more efficient solution is not straightforward.

Lets step back to a simpler problem.

private void findBadWords(List <String> wordList, List <String> auditList) {
    for(String badWord : auditList) {
        for (String word : wordList) {
            if (word.equals(badWord)) {
                System.err.println("Found a bad word");
            }
        }
    }
}

Suppose that wordList contains N words and auditList contains M words. Some simple analysis will show that the inner loop is executed N x M times. The N factor is unavoidable, but the M factor is disturbing. It means that the more "bad" words you have to check for the longer it takes to check.

There is a better way to do this:

private void findBadWords(List <String> wordList, HashSet<String> auditWords) {
    for (String word : wordList) {
        if (auditWords.contains(word))) {
            System.err.println("Found a bad word");
        }
    }
}

Why is that better? It is better (faster) because HashSet::contains doesn't need to check all of the audit words one at a time. In fact, in the optimal case it will check none of them (!) and the average case just one or two of them. (I won't go into why, but if you want to understand read the Wikipedia page on hash tables.)


But your problem is more complicated. You are using String::contains to test if each comment contains each bad word. That is not a simple string equality test (as per my simplified version).

What to do?

Well one potential solution is to split the the comments into an array of words (e.g. using String::split and then user the HashSet lookup approach. However:

  • That changes the behavior of your code. (In a good way actually: read up on the Scunthorpe problem!) You will now only match the audit words is they are actual words in the comment text.

  • Splitting a string into words is not cheap. If you use String::split it entails creating and using a Pattern object to find the word boundaries, creating substrings for each word and putting them into an array. You can probably do better, but it is always going to be a non-trivial calculation.

So the real question will be whether the optimization is going to pay off. That is ultimately going to depend on the value of M; i.e. the number of bad words you are looking for. The larger M is, the more likely it will be to split the comments into words and use a HashSet to test the words.

Another possible solution doesn't involve splitting the comments. You could take the list of audit words and assemble them into a single regex like this: \b(word-1|word-2|...|word-n)\b. Then use this regex with Matcher::find to search each comment string for bad words. The performance will depend on the optimizing capability of the regex engine in your Java platform. It has the potential to be faster than splitting.


My advice would be to benchmark and profile your entire application before you start. Only optimize:

  1. when the benchmarking says that the overall performance of the requests where this comment checking occurs is concerning. (If it is OK, don't waste your time optimizing.)

  2. when the profiling says that this method is a performance hotspot. (There is a good chance that the real hotspots are somewhere else. If so, you should optimize them rather than this method.)

Note there is an assumption that you have (sufficiently) completed your application and created a realistic benchmark for it before you think about optimizing. (Premature optimization is a bad idea ... unless you really know what you are doing.)

Upvotes: 5

Related Questions