Magnus Minor
Magnus Minor

Reputation: 11

What's the fastest way of checking if one string contains the characters of another?

I get String a and b and check if b contains the exact characters of a. For example: "ABBA" and "BAAAAA" returns false, "ABBA" and "ABABAB" returns true. I'm made a into and array with each String value and check if b contains that value, removing the value if it does so it doesn't find it twice.

However, the method is too slow, apparently on 12 seconds for some big strings. I've been trying but I haven't found a faster solution. Please help me out if you can!

public static boolean inneholdt(String a, String b)
{
    int k = 0;
    String[] Inn = a.split("(?!^)");

    for (int i = 0; i < Inn.length; i++)
    {
        if(b.contains(Inn[i]))
        {
            b = b.replaceFirst(Inn[i], "");

            k++;
        }
    }

    if(k >= Inn.length)
    {
        return true;
    } else return false;
}

Upvotes: 1

Views: 455

Answers (2)

Dici
Dici

Reputation: 25950

If I understood the problem, there are two main ways to do it :

  • sort the char arrays of both strings, and check the shortest array is a prefix of the longest one

  • populate a Map<Character, Integer>, which counts the number of occurrences of each character in the shortest string. Then, iterate through the longest string and decrement the count of each encountered character. If one counter reaches 0, remove it from the map. Return true if the map gets empty. If after consuming all the characters of the longest string the map did not get empty, return false.

The first one will take more time for long strings and is more likely to use a lot of memory because the second solution "compresses" the redundancy whereas the sorted array may contain long sequences of identical characters. However, if is very easy to write, read and understand the first one so if you don't need crazy performance it's ok.

I'll show you the code for the second solution :

// in Java 8. For older versions, it is also easy but more verbose
public static boolean inneholdt(String a, String b) {
    if (b.length() > a.length()) return false;

    Map<Character, Integer> countChars = new HashMap<>();
    for (char ch : b.toCharArray()) countChars.put(ch, countChars.getOrDefault(ch, 0) + 1);
    for (char ch : a.toCharArray()) {
        Integer count = countChars.get(ch);
        if (count != null) {
            if (count == 1) countChars.remove(ch);
            else            countChars.put(ch, count - 1);
        }
        if (countChars.isEmpty()) return true;
    }
    return false;
}

Note that this solution is optimized so that the execution time depends on the shortest string in the best case. If b is contained in a, we will most likely not iterate through all characters of a. This solution is then great if b is much shorter than a, because if is very likely in this case that b is contained in a.

In answer to Max's benchmark, I tried myself to compare the performance. Here is what I found :

My version :
Mean time of 3 ms with lenA = 50000, lenB = 50
Mean time of 1 ms with lenA = 50000, lenB = 500
Mean time of 1 ms with lenA = 50000, lenB = 5000
Mean time of 1 ms with lenA = 50000, lenB = 50000
Mean time of 10 ms with lenA = 5000000, lenB = 5000
Mean time of 18 ms with lenA = 5000000, lenB = 50000
Mean time of 93 ms with lenA = 5000000, lenB = 500000
Mean time of 519 ms with lenA = 5000000, lenB = 5000000
Mean time of 75 ms with lenA = 50000000, lenB = 50000
Mean time of 149 ms with lenA = 50000000, lenB = 500000
Mean time of 674 ms with lenA = 50000000, lenB = 5000000
Mean time of 9490 ms with lenA = 50000000, lenB = 50000000

Max's parallel solution :
Mean time of 89 ms with lenA = 50000, lenB = 50
Mean time of 22 ms with lenA = 50000, lenB = 500
Mean time of 23 ms with lenA = 50000, lenB = 5000
Mean time of 36 ms with lenA = 50000, lenB = 50000
Mean time of 2962 ms with lenA = 5000000, lenB = 5000
Mean time of 2021 ms with lenA = 5000000, lenB = 50000
Mean time of 2200 ms with lenA = 5000000, lenB = 500000
Mean time of 3988 ms with lenA = 5000000, lenB = 5000000
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.concurrent.ForkJoinTask.recordExceptionalCompletion(Unknown Source)
    at java.util.concurrent.CountedCompleter.internalPropagateException(Unknown Source)
    at java.util.concurrent.ForkJoinTask.setExceptionalCompletion(Unknown Source)
    at java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
    at java.util.concurrent.ForkJoinPool$WorkQueue.pollAndExecCC(Unknown Source)
    at java.util.concurrent.ForkJoinPool.externalHelpComplete(Unknown Source)
    at java.util.concurrent.ForkJoinTask.externalAwaitDone(Unknown Source)
    at java.util.concurrent.ForkJoinTask.doInvoke(Unknown Source)
    at java.util.concurrent.ForkJoinTask.invoke(Unknown Source)
    at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(Unknown Source)
    at java.util.stream.AbstractPipeline.evaluate(Unknown Source)
    at java.util.stream.ReferencePipeline.collect(Unknown Source)
    at Stack.inneholdt(Stack.java:34)
    at Stack.test(Stack.java:61)
    at Stack.main(Stack.java:12)

This shows that the implementation of Collectors.groupingBy is quite memory consuming. Max's solution isn't bad in principle even if it does more work than it could, be this is a high level solution so it has no control over some implementation details such as the way the records are grouped. Looking at Java standard library's code, it looks like a sort is performed so it requires some memory especially because several threads are sorting at the same time. I ran this with my default setup of -Xmx2g. I reran it using -Xmx4g.

My version :
Mean time of 3 ms with lenA = 50000, lenB = 50
Mean time of 1 ms with lenA = 50000, lenB = 500
Mean time of 2 ms with lenA = 50000, lenB = 5000
Mean time of 5 ms with lenA = 50000, lenB = 50000
Mean time of 7 ms with lenA = 5000000, lenB = 5000
Mean time of 17 ms with lenA = 5000000, lenB = 50000
Mean time of 93 ms with lenA = 5000000, lenB = 500000
Mean time of 642 ms with lenA = 5000000, lenB = 5000000
Mean time of 64 ms with lenA = 50000000, lenB = 50000
Mean time of 161 ms with lenA = 50000000, lenB = 500000
Mean time of 836 ms with lenA = 50000000, lenB = 5000000
Mean time of 11962 ms with lenA = 50000000, lenB = 50000000

Max's parallel solution :
Mean time of 45 ms with lenA = 50000, lenB = 50
Mean time of 18 ms with lenA = 50000, lenB = 500
Mean time of 19 ms with lenA = 50000, lenB = 5000
Mean time of 35 ms with lenA = 50000, lenB = 50000
Mean time of 1691 ms with lenA = 5000000, lenB = 5000
Mean time of 1162 ms with lenA = 5000000, lenB = 50000
Mean time of 1817 ms with lenA = 5000000, lenB = 500000
Mean time of 1671 ms with lenA = 5000000, lenB = 5000000
Mean time of 12052 ms with lenA = 50000000, lenB = 50000
Mean time of 10034 ms with lenA = 50000000, lenB = 500000
Mean time of 9467 ms with lenA = 50000000, lenB = 5000000
Mean time of 18122 ms with lenA = 50000000, lenB = 50000000

This time it runs fine, but still slowly. Note that the test cases for testing each version where different, this is a quite poorly done benchmark but I think it's enough to show Collectors.groupingBy is very memory consuming and that not trying to return as soon as possible is a great disadvantage.

The code is available here.

Upvotes: 2

Max Malysh
Max Malysh

Reputation: 31545

Java 8 + lambda expressions.

public static boolean inneholdt(String a, String b) {
    // Here we are counting occurrences of characters in the string
    Map<Integer, Long> aCounted = a.chars().parallel().boxed().collect(Collectors.groupingBy(o -> o, Collectors.counting()));
    Map<Integer, Long> bCounted = b.chars().parallel().boxed().collect(Collectors.groupingBy(o -> o, Collectors.counting()));

    // Now we're checking if the second string contains all the characters from the first
    return bCounted.keySet().parallelStream().allMatch(
            x -> bCounted.getOrDefault(x, 0l) >= aCounted.getOrDefault(x, 0l)
    );
}

How does it work?

  1. Count the number of occurrences of every character in both strings.
  2. Check if the second string has at least the same number of occurrences for every character.

This solution takes advantage of parallel execution and works with any literals, as it counts code points. However, @Dici's solution will be much faster for the case he describes.

Also, I've became interested in how parallel streams affect performance, so there are some numbers. N stands for the length of the strings.

Alphanumeric N=50000:    26ms   vs 10ms   vs 16ms
Alphanumeric N=5000000:  425ms  vs 460ms  vs 162ms
Alphanumeric N=50000000: 3812ms vs 3297ms vs 1933ms

First number is for @Dici's solution, second is for my solution with ordinary streams, and the third is for the version from this answer.

Upvotes: 2

Related Questions