Paco de la Vega
Paco de la Vega

Reputation: 131

What is the most effective algorithm to compare a large number of strings?

Let's take 2 string arraylists

List<String> namesListA = new ArrayList<>(/*50 000 strings*/);
List<String> namesListB = new ArrayList<>(/*400 000 strings*/);

removeAll method seems not working. After:

namesListA.removeAll(namesListB);

namesListA.size() is still 50000. Edit: Input data was incorrect, it actually works but takes a long lime.

I wrote the following brute-force code:

boolean match;
    for (String stringA: namesListA)
    {
        match = false;
        for (String stringB: namesListB)
        {
            if (stringA.equals(stringB))
            {
                match = true;
                break;
            }
        }
        if (!match)
        {
            finallist.add(stringA);
        }
    }

But it takes 8 hours to perform. it there any known effective algorithm for searching strings? Like to sort strings in alphabetical order and then search letter by letter or something like this.

Upvotes: 1

Views: 1083

Answers (7)

radoh
radoh

Reputation: 4825

You could put elements of list namesListB into a new Set (preferably HashSet). Then it is much more effective to call namesListA.removeAll(setFromListB);, since the implementation of ArrayList.removeAll calls Collection.contains() which is much more effective in a Set (HashSet) than in an ArrayList (HashSet.contains() has constant time performance, while ArrayList.contains() has linear performance).

Anyway, namesListA.removeAll(namesListB); should work, if namesListA doesn't change, then the 2 lists have no elements in common.

Estimation of time complexity (N = namesListA.length, M = namesListB.length):
Creating the HashSet from namesListB: O(M)
Calling namesListA.removeAll(setListB): O(N * 1) = O(N)
In total: O(M + N) (which could be written as O(M) since M>N, but I'm not sure)

Upvotes: 6

LastFreeNickname
LastFreeNickname

Reputation: 1455

If it's not important which element is duplicate but only if there is any you can let the Collections do the trick for you.

int sizeA = listA.size();
int sizeB = listB.size();

Set merger = new HashSet((sizeA+sizeB)*someLoadFactor);
merger.addAll(listA);
merger.addAll(listB);
// Sets do not contain duplicates!

if (merger.size() < sizeA + sizeB){
    return true;
}
return false;

This runs in O(i+j) so efficiently O(n)!

Upvotes: 0

LastFreeNickname
LastFreeNickname

Reputation: 1455

Here's a solution in O(n*logn). Should be faster than the approaches posted yet. Edit: If you don't need the exact element, my other approach is faster.

1.) Sort both lists

Use Collections.sort(...) for efficient sorting in O(n*logn).

2.) Compare with two iterators

Fetch two iterators over the two lists. Then:

while(leftIterator.hasNext() && rightIterator.hasNext(){
    int comparisonResult = leftElement.compare(rightElement);
    if (comparisonResult == -1){
        leftElement = leftIterator.next();
    }
    else if (comparisonResult == 1){
        rightElement = rightIterator.next();
    }
    else{
        // found it!
        return true;
    }
}

(Sorry if I mistyped, don't have an IDE at my hand)

=> Sorting is in O(ilogi + jlogj))

=> Comparison is in O(i+j)


Result performance is efficiently in class O(n*logn). This should work nicely.

Upvotes: 1

Srikanth Balaji
Srikanth Balaji

Reputation: 2718

Doing a list of removeAll is probably a better solution, Considering you have both the variables with 50k and 400k sized Lists

namesListA.removeAll(namesListB);

Upvotes: 0

Nicolas Filotto
Nicolas Filotto

Reputation: 45005

I would recommend to use an HashSet instead of a List to store the Strings of the biggest collection in order to know whether the collection contains or not a given String with a time complexity of O(1) instead of O(n), then use removeAll(Collection<?> c) to keep only the Strings that are not in the second collection as next:

List<String> namesListA = new ArrayList<>(/*50 000 strings*/);
Set<String> namesSetB = new HashSet<>(/*400 000 strings*/);
namesListA.removeAll(namesSetB);

Upvotes: 1

Codor
Codor

Reputation: 17605

One possibility would be to parallelize the removal. The lists namesListA and namesListB can be grouped by starting character; then the removal could be done group-wise in parallel and the resulting lists could be concatenated again.

Assuming some standard Latin alphabet, this would result in roughly 26 groups which could be processed in parallel. If 4 threads can be run in parallel, I would expect a significant speedup.

Upvotes: 1

Mincong Huang
Mincong Huang

Reputation: 5562

Create a set for the 400 000 names in namesListB. Then use this set to remove the undesired elements of namesListA.

List<String> namesListA = new ArrayList<>(/*50 000 strings*/);
List<String> namesListB = new ArrayList<>(/*400 000 strings*/);

Set<String> undesiredNames = new HashSet<>(namesListB);

for (String name : namesListA) {
    if (undesiredNames.contains(name)) {
        namesListA.remove(name);
    }
}

Upvotes: 2

Related Questions