Reputation: 14260

Replacing lots of substrings in big strings

One of our module's performance relies highly on how we replace substrings in string.

We form a "replacement map" which can contain more than 3500 string pairs and then we apply it with StringUtils.replaceEach(text, searchList, replacementList) to big strings (several MBs).

The keys and values are all unique and in most cases have the same character length (but it's not something we can rely on).

Is there a more sophisticated approach to my task than StringUtils.replaceEach()? Something which may be an overkill for simple replacements solved by replaceEach(), but which is much faster in my "heavy" case.

Upvotes: 7

Answers (6)

zeppelin

Reputation: 9365

You can use a regexp engine, to effectively match your keys against the input string, and replace them.

First, concatenate all your keys with an alternation operator, like that:

var keys = "keyA|keyB|keyC";

Next, compile a pattern:

Pattern pattern = Pattern.compile("(" + keys + ")")

Create a matcher against your input text:

Matcher matcher= pattern.matcher(text);

Now, apply your regexp in a loop, to find all the keys in your text , and use appendReplacement (which is an "inline" string replacement method), to replace all of them with the corresponding value:

StringBuffer sb = new StringBuffer();
while (matcher.find()) {
    matcher.appendReplacement(sb,dictionary.get(matcher.group(0)));
}
matcher.appendTail(sb);

And here you go.

Note that this might look a bit naive at first, but indeed regexp engine is heavily optimized for the task at hand, and as Java regexp implementation also allows for an "inline" replacement, it all works very well.

I did a small benchmark, by applying a list of color names (~200 different color names), as defined in /usr/share/X11/rgb.txt against the "Crime and Punishment" by Fyodor Dostoyevsky, I downloaded from Project Gutenberg (~1MB in size), using the technique described, and it worked around

x12 times faster than StringUtils.replaceEach - 900ms vs 10700ms

for the latter (not counting for the Pattern compilation time).

P.S. if your keys can potentially contain characters, unsafe for regexp , like .^$(), you should use Pattern.quote() before adding them to your pattern.

Sidenote:

This method will replace keys, in the order, they appear in the pattern list, e.g. "a=>1|b=>2|aa=>3" when applied to "welcome to bazaar" will result in "welcome to b1z11r", and not "welcome to b1z3r", if you want the longest match, you should sort your keys lexicographically before adding them to the pattern (i.e. "b|aa|a"). It also applies to your original StringUtils.replaceEach() method.

Update:

The method above should work nice for the problem, as formulated in the original question, i.e. when the size of the replacement map is (relatively) small compared to the input data size.

If, instead you have a very long dictionary, applied to a short text, the linear search as done by StringUtils.replaceEach() can be faster than it.

I've made an additional benchmark illustrating that, by applying a dictionary of 10000 randomly chosen words (+4 characters long):

cat /usr/share/dict/words | grep -E "^.{4,}$" | shuf | head -10000

against the: 1024,2048,4096,8192,16384,32768,65536,131072,262144 and 524288 characters long excerpts from the very same "Crime and Punishment" text.

The results are given below:

text    Ta(ms)  Tb(ms)  Ta/Tb(speed up)
---------------------------------------
1024    99      240     0.4125
2048    43      294     0.1462585
4096    113     721     0.1567267
8192    128     1329    0.0963130
16384   320     2230    0.1434977
32768   2052    3708    0.5533981
65536   6811    6650    1.0242106
131072  32422   12663   2.5603728
262144  150655  23011   6.5470862
524288  614634  29874   20.574211

Ta - StringUtils.replaceEach() time
Tb - matcher.appendReplacement() time

Note the pattern string length is 135537 bytes (all 10000 keys concatenated)

Upvotes: 7

okutane

Reputation: 14260

While appendReplacement solution proposed by @zeppelin was surprisingly fast on "heaviest piece of data" it turned out to be a nightmare with bigger map.

The best solution so far turned out to be a composition of what we had (StringUtils.replaceEach) and what was proposed:

protected BackReplacer createBackReplacer(Map<ReplacementKey, String> replacementMap) {
        if (replacementMap.isEmpty()) {
            return new BackReplacer() {
                @Override
                public String backReplace(String str) {
                    return str;
                }
            };
        }

        if (replacementMap.size() > MAX_SIZE_FOR_REGEX) {
            final String[] searchStrings = new String[replacementMap.size()];
            final String[] replacementStrings = new String[replacementMap.size()];

            int counter = 0;
            for (Map.Entry<ReplacementKey, String> replacementEntry : replacementMap.entrySet()) {
                searchStrings[counter] = replacementEntry.getValue();
                replacementStrings[counter] = replacementEntry.getKey().getValue();
                counter++;
            }

            return new BackReplacer() {
                @Override
                public String backReplace(String str) {
                    return StringUtils.replaceEach(str, searchStrings, replacementStrings);
                }
            };
        }

        final Map<String, String> replacements = new HashMap<>();
        StringBuilder patternBuilder = new StringBuilder();

        patternBuilder.append('(');
        for (Map.Entry<ReplacementKey, String> entry : replacementMap.entrySet()) {
            replacements.put(entry.getValue(), entry.getKey().getValue());
            patternBuilder.append(entry.getValue()).append('|');
        }

        patternBuilder.setLength(patternBuilder.length() - 1);
        patternBuilder.append(')');

        final Pattern pattern = Pattern.compile(patternBuilder.toString());

        return new BackReplacer() {
            @Override
            public String backReplace(String str) {
                if (str.isEmpty()) {
                    return str;
                }

                StringBuffer sb = new StringBuffer(str.length());

                Matcher matcher = pattern.matcher(str);
                while (matcher.find()) {
                    matcher.appendReplacement(sb, replacements.get(matcher.group(0)));
                }
                matcher.appendTail(sb);

                return sb.toString();
            }
        };
    }

StringUtils algorithm (MAX_SIZE_FOR_REGEX=0):

type=TIMER, name=*.run, count=8127, min=4.239809, max=4235197.925261, mean=645.736554, stddev=47197.97968925558, duration_unit=milliseconds

appendReplace algorithm (MAX_SIZE_FOR_REGEX=1000000):

type=TIMER, name=*.run, count=8155, min=4.374516, max=7806145.439165999, mean=1145.757953, stddev=86668.38562815856, duration_unit=milliseconds

Mixed solution (MAX_SIZE_FOR_REGEX=5000):

type=TIMER, name=*.run, count=8155, min=3.5862789999999998, max=376242.25076799997, mean=389.68986564688714, stddev=11733.9997814448, duration_unit=milliseconds

Our data:

type=HISTOGRAM, name=initialValueLength, count=569549, min=0, max=6352327, mean=6268.940661478599, stddev=198123.040651236, median=12.0, p75=16.0, p95=32.0, p98=854.0, p99=1014.5600000000013, p999=6168541.008000023
type=HISTOGRAM, name=replacementMap.size, count=8155, min=0, max=65008, mean=73.46108949416342, stddev=2027.471388983965, median=4.0, p75=7.0, p95=27.549999999999955, p98=55.41999999999996, p99=210.10000000000036, p999=63138.68900000023

This change halved time spend in StringUtils.replaceEach in former solution and gave us 25% performance boost in our module which was mostly IO-bound.

Upvotes: 0

BeeOnRope

Reputation: 64925

The slow part of this algorithm is finding all the matches. The replacement is straightforward if done in a smart way (i.e., in a temporary char buffer, only shifting each character at most once).

So really your question simplifies to the "multi-string search", which is already a well-studied problem. You can find a good summary of the approaches in this question - but the one line summary is "grep does a good job".

Zeppelin already showed a reasonable loop for this - the appendReplacement behavior makes sure you won't be shifting things around unnecessarily (which would degrade this to O(n)).

Upvotes: 0

Bob Dalgleish

Reputation: 8227

An optimal method for dealing with this situation: Pre-compile the source strings into code. Scan each of your source strings for the replacement keys; break the string into a series of code pieces with a function to insert the key result into a stream. For example: The following source string:

The quick $brown $fox jumped over the $lazy dog.

becomes

public StringBuilder quickBrown(Map<String, String> dict) {
  StringBuilder sb = new StringBuilder();
  sb.append("The quick ");
  sb.append(dict.getOrElse("$brown", "brown"));
  sb.append(" ");
  sb.append(dict.getOrElse("$fox", "fox"));
  sb.append(" jumped over the ");
  sb.append(dict.getOrElse("$lazy", "lazy");
  sb.append(" dog.");
  return sb;
}

Then you invoke the method corresponding to the particular string with the dictionary of mappings you want substituted.

Note that by "scan" and "translate", I mean use a program to generate the Java code and then dynamically load the compiled class files as you need them.

Upvotes: 0

Bohemian

Reputation: 425043

StringUtils is using an O(n * m) algorithm (for every word to be replaced, make the replacement in the input) . When m (the number of words to be substituted) is small, this is effectively O(n) (the size of the input).

However, with a "large" number of substitutions to be checked, you will likely be better off processing each word of input, which will complete in O(n) time.

Map<String, String> subs = new HashMap<>(); // populated
String replaced = Arrays.stream(input.split("\\b")) // O(n)
    .map(w -> subs.getOrDefault(w, w)) // O(1)
    .collect(Collectors.joining("")); // O(n)

Splitting on word boundaries not only preserves whitespace (by not consuming input) but makes the code rather simple.

Upvotes: 0

Dariusz

Reputation: 22251

First of all - if you are talking about optimisation, post your profiling results. It is the only reliable source of information about what should be optimized (see the Third Rule of Optimization).

If you've determined that the string operations do take the most time, then there are two things to keep in mind.

First of all, Java Strings are immutable. Each time you call a replace method you create a new string which, in your case most likely, means a lot of memory allocating. Java's gotten better with it over the years, still, if you can skip it, then do it. I've checked, StringUtils.replaceEach does use a buffer and should be relatively memory efficient. Also, especially with a custom search algorithm from the second note, you could implement a custom solution for replacing. The custom solution may consist of creating your own char buffer for efficient replacing, using StringBuilder/StringBuffer for replacing (you'd have to keep track of lengths of replaces, because calling .toString() before each search on StringBuffer will be as inefficient as replacing the strings manually).

Secondly, there's the search algorithm itself. I do not know which is used by Apache's StringUtils, but Java's default implementation is not optimal. You could use a separate library for searching.

Upvotes: 0

Replacing lots of substrings in big strings

Answers (6)

Related Questions