Reputation: 14260
One of our module's performance relies highly on how we replace substrings in string.
We form a "replacement map" which can contain more than 3500 string pairs and then we apply it with StringUtils.replaceEach(text, searchList, replacementList)
to big strings (several MBs).
The keys and values are all unique and in most cases have the same character length (but it's not something we can rely on).
Is there a more sophisticated approach to my task than StringUtils.replaceEach()
? Something which may be an overkill for simple replacements solved by replaceEach()
, but which is much faster in my "heavy" case.
Upvotes: 7
Views: 4011
Reputation: 9365
You can use a regexp engine, to effectively match your keys against the input string, and replace them.
First, concatenate all your keys with an alternation operator, like that:
var keys = "keyA|keyB|keyC";
Next, compile a pattern:
Pattern pattern = Pattern.compile("(" + keys + ")")
Create a matcher against your input text:
Matcher matcher= pattern.matcher(text);
Now, apply your regexp in a loop, to find all the keys in your text , and use appendReplacement (which is an "inline" string replacement method), to replace all of them with the corresponding value:
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb,dictionary.get(matcher.group(0)));
}
matcher.appendTail(sb);
And here you go.
Note that this might look a bit naive at first, but indeed regexp engine is heavily optimized for the task at hand, and as Java regexp implementation also allows for an "inline" replacement, it all works very well.
I did a small benchmark, by applying a list of color names (~200 different color names), as defined in /usr/share/X11/rgb.txt against the "Crime and Punishment" by Fyodor Dostoyevsky, I downloaded from Project Gutenberg (~1MB in size), using the technique described, and it worked around
x12 times faster than StringUtils.replaceEach - 900ms vs 10700ms
for the latter (not counting for the Pattern compilation time).
P.S. if your keys can potentially contain characters, unsafe for regexp , like .^$(), you should use Pattern.quote() before adding them to your pattern.
Sidenote:
This method will replace keys, in the order, they appear in the pattern list, e.g. "a=>1|b=>2|aa=>3" when applied to "welcome to bazaar" will result in "welcome to b1z11r", and not "welcome to b1z3r", if you want the longest match, you should sort your keys lexicographically before adding them to the pattern (i.e. "b|aa|a"). It also applies to your original StringUtils.replaceEach() method.
Update:
The method above should work nice for the problem, as formulated in the original question, i.e. when the size of the replacement map is (relatively) small compared to the input data size.
If, instead you have a very long dictionary, applied to a short text, the linear search as done by StringUtils.replaceEach() can be faster than it.
I've made an additional benchmark illustrating that, by applying a dictionary of 10000 randomly chosen words (+4 characters long):
cat /usr/share/dict/words | grep -E "^.{4,}$" | shuf | head -10000
against the: 1024,2048,4096,8192,16384,32768,65536,131072,262144 and 524288 characters long excerpts from the very same "Crime and Punishment" text.
The results are given below:
text Ta(ms) Tb(ms) Ta/Tb(speed up)
---------------------------------------
1024 99 240 0.4125
2048 43 294 0.1462585
4096 113 721 0.1567267
8192 128 1329 0.0963130
16384 320 2230 0.1434977
32768 2052 3708 0.5533981
65536 6811 6650 1.0242106
131072 32422 12663 2.5603728
262144 150655 23011 6.5470862
524288 614634 29874 20.574211
Note the pattern string length is 135537 bytes (all 10000 keys concatenated)
Upvotes: 7
Reputation: 14260
While appendReplacement solution proposed by @zeppelin was surprisingly fast on "heaviest piece of data" it turned out to be a nightmare with bigger map.
The best solution so far turned out to be a composition of what we had (StringUtils.replaceEach) and what was proposed:
protected BackReplacer createBackReplacer(Map<ReplacementKey, String> replacementMap) {
if (replacementMap.isEmpty()) {
return new BackReplacer() {
@Override
public String backReplace(String str) {
return str;
}
};
}
if (replacementMap.size() > MAX_SIZE_FOR_REGEX) {
final String[] searchStrings = new String[replacementMap.size()];
final String[] replacementStrings = new String[replacementMap.size()];
int counter = 0;
for (Map.Entry<ReplacementKey, String> replacementEntry : replacementMap.entrySet()) {
searchStrings[counter] = replacementEntry.getValue();
replacementStrings[counter] = replacementEntry.getKey().getValue();
counter++;
}
return new BackReplacer() {
@Override
public String backReplace(String str) {
return StringUtils.replaceEach(str, searchStrings, replacementStrings);
}
};
}
final Map<String, String> replacements = new HashMap<>();
StringBuilder patternBuilder = new StringBuilder();
patternBuilder.append('(');
for (Map.Entry<ReplacementKey, String> entry : replacementMap.entrySet()) {
replacements.put(entry.getValue(), entry.getKey().getValue());
patternBuilder.append(entry.getValue()).append('|');
}
patternBuilder.setLength(patternBuilder.length() - 1);
patternBuilder.append(')');
final Pattern pattern = Pattern.compile(patternBuilder.toString());
return new BackReplacer() {
@Override
public String backReplace(String str) {
if (str.isEmpty()) {
return str;
}
StringBuffer sb = new StringBuffer(str.length());
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
matcher.appendReplacement(sb, replacements.get(matcher.group(0)));
}
matcher.appendTail(sb);
return sb.toString();
}
};
}
StringUtils algorithm (MAX_SIZE_FOR_REGEX=0):
type=TIMER, name=*.run, count=8127, min=4.239809, max=4235197.925261, mean=645.736554, stddev=47197.97968925558, duration_unit=milliseconds
appendReplace algorithm (MAX_SIZE_FOR_REGEX=1000000):
type=TIMER, name=*.run, count=8155, min=4.374516, max=7806145.439165999, mean=1145.757953, stddev=86668.38562815856, duration_unit=milliseconds
Mixed solution (MAX_SIZE_FOR_REGEX=5000):
type=TIMER, name=*.run, count=8155, min=3.5862789999999998, max=376242.25076799997, mean=389.68986564688714, stddev=11733.9997814448, duration_unit=milliseconds
Our data:
type=HISTOGRAM, name=initialValueLength, count=569549, min=0, max=6352327, mean=6268.940661478599, stddev=198123.040651236, median=12.0, p75=16.0, p95=32.0, p98=854.0, p99=1014.5600000000013, p999=6168541.008000023
type=HISTOGRAM, name=replacementMap.size, count=8155, min=0, max=65008, mean=73.46108949416342, stddev=2027.471388983965, median=4.0, p75=7.0, p95=27.549999999999955, p98=55.41999999999996, p99=210.10000000000036, p999=63138.68900000023
This change halved time spend in StringUtils.replaceEach in former solution and gave us 25% performance boost in our module which was mostly IO-bound.
Upvotes: 0
Reputation: 64925
The slow part of this algorithm is finding all the matches. The replacement is straightforward if done in a smart way (i.e., in a temporary char buffer, only shifting each character at most once).
So really your question simplifies to the "multi-string search", which is already a well-studied problem. You can find a good summary of the approaches in this question - but the one line summary is "grep does a good job".
Zeppelin already showed a reasonable loop for this - the appendReplacement
behavior makes sure you won't be shifting things around unnecessarily (which would degrade this to O(n)
).
Upvotes: 0
Reputation: 8227
An optimal method for dealing with this situation: Pre-compile the source strings into code. Scan each of your source strings for the replacement keys; break the string into a series of code pieces with a function to insert the key result into a stream. For example: The following source string:
The quick $brown $fox jumped over the $lazy dog.
becomes
public StringBuilder quickBrown(Map<String, String> dict) {
StringBuilder sb = new StringBuilder();
sb.append("The quick ");
sb.append(dict.getOrElse("$brown", "brown"));
sb.append(" ");
sb.append(dict.getOrElse("$fox", "fox"));
sb.append(" jumped over the ");
sb.append(dict.getOrElse("$lazy", "lazy");
sb.append(" dog.");
return sb;
}
Then you invoke the method corresponding to the particular string with the dictionary of mappings you want substituted.
Note that by "scan" and "translate", I mean use a program to generate the Java code and then dynamically load the compiled class files as you need them.
Upvotes: 0
Reputation: 425043
StringUtils
is using an O(n * m)
algorithm (for every word to be replaced, make the replacement in the input) . When m
(the number of words to be substituted) is small, this is effectively O(n)
(the size of the input).
However, with a "large" number of substitutions to be checked, you will likely be better off processing each word of input, which will complete in O(n)
time.
Map<String, String> subs = new HashMap<>(); // populated
String replaced = Arrays.stream(input.split("\\b")) // O(n)
.map(w -> subs.getOrDefault(w, w)) // O(1)
.collect(Collectors.joining("")); // O(n)
Splitting on word boundaries not only preserves whitespace (by not consuming input) but makes the code rather simple.
Upvotes: 0
Reputation: 22251
First of all - if you are talking about optimisation, post your profiling results. It is the only reliable source of information about what should be optimized (see the Third Rule of Optimization).
If you've determined that the string operations do take the most time, then there are two things to keep in mind.
First of all, Java Strings are immutable. Each time you call a replace method you create a new string which, in your case most likely, means a lot of memory allocating. Java's gotten better with it over the years, still, if you can skip it, then do it. I've checked, StringUtils.replaceEach
does use a buffer and should be relatively memory efficient. Also, especially with a custom search algorithm from the second note, you could implement a custom solution for replacing. The custom solution may consist of creating your own char buffer for efficient replacing, using StringBuilder
/StringBuffer
for replacing (you'd have to keep track of lengths of replaces, because calling .toString()
before each search on StringBuffer
will be as inefficient as replacing the strings manually).
Secondly, there's the search algorithm itself. I do not know which is used by Apache's StringUtils
, but Java's default implementation is not optimal. You could use a separate library for searching.
Upvotes: 0