joelproko
joelproko

Reputation: 89

Efficient and non-interfering way of replacing multiple substrings in a String

I'm trying to apply the same replacement instructions several thousand times to different input strings with as little overhead as possible. I need to consider two things for this:

  1. The search Strings aren't necessarily all the same length: one may be just "a", another might be "ch", yet another might be "sch"
  2. What was already replaced shall not be replaced again: If the replacement patterns are [a->e; e->a], "beat" should become "baet", not "baat" or "beet".

With that in mind, this is the code I came up with:

public class Replacements {
    private String[] search;
    private String[] replace;
    Replacements(String[] s, String[] r)
    {
        if (s.length!=r.length) throw new IllegalArgumentException();
        Map<String,String> map = new HashMap<String,String>();
        for (int i=0;i<s.length;i++)
        {
            map.put(s[i], r[i]);
        }
        List<String> sortedKeys = new ArrayList(map.keySet());
        Collections.sort(sortedKeys, new StringLengthComparator());
        this.search = sortedKeys.toArray(new String[0]);
        Stack<String> r2 = new Stack<>();
        sortedKeys.stream().forEach((i) -> {
            r2.push(map.get(i));
        });
        this.replace = r2.toArray(new String[0]);
    }
    public String replace(String input)
    {
        return replace(input,0);
    }
    private String replace(String input,int i)
    {
        String out = "";
        List<String> parts = Arrays.asList(input.split(this.search[i],-1));
        for (Iterator it = parts.iterator(); it.hasNext();)
        {
            String part = it.next().toString();
            if (part.length()>0 && i<this.search.length-1) out += replace(part,i+1);
            if (it.hasNext()) out += this.replace[i];
        }
        return out;
    }
}

And then

String[] words;
//fill variable words
String[] s_input = "ou|u|c|ch|ce|ci".split("\\|",-1);
String[] r_input = "u|a|k|c|se|si".split("\\|",-1);
Replacements reps = new Replacements(s_input,r_input);
for (String word : words) {
    System.out.println(reps.replace(word));
}

(s_input and r_input would be up to the user, so they're just examples, just like the program wouldn't actually use println())

This code makes sure longer search strings get looked for first and also covers the second condition above.

It is, however, quite costly. What would be the most efficient way to accomplish what I'm doing here (especially if the number of Strings in words is significantly large)?

With my current code, "couch" should be converted into "kuc" (except it doesn't, apparently; it now does, thanks to the -1 in split(p,-1))

Upvotes: 1

Views: 132

Answers (3)

Robin479
Robin479

Reputation: 1686

public class Replacements
{
    private String[] search; // sorted in descending length and order, eg: sch, ch, c
    private String[] replace; // corresponding replacement

    Replacements(String[] s, String[] r)
    {
        if (s.length != r.length)
            throw new IllegalArgumentException();

        final TreeMap<String, String> map = new TreeMap<String, String>(Collections.reverseOrder());

        for (int i = 0; i < s.length; i++)
            map.put(s[i], r[i]);

        this.search = map.keySet().toArray(new String[map.size()]);
        this.replace = map.values().toArray(new String[map.size()]);
    }

    public String replace(String input)
    {
        final StringBuilder result = new StringBuilder();

        // start of yet-to-be-copied substring
        int s = 0;

        SEARCH:
        for (int i = s; i < input.length(); i++)
        {
            for (int p = 0; p < this.search.length; p++)
            {
                if (input.regionMatches(i, this.search[p], 0, this.search[p].length()))
                {
                    // append buffer and replacement
                    result.append(input, s, i).append(this.replace[p]);

                    // skip beyond current match and reset buffer
                    i += this.search[p].length();
                    s = i--;

                    continue SEARCH;
                }
            }
        }

        if (s == 0) // no matches? no changes!
            return input;

        // append remaining buffer
        return result.append(input, s, input.length()).toString();
    }
}

Upvotes: 0

Joop Eggen
Joop Eggen

Reputation: 109547

One could make a regex pattern from the keys and leave it to that module for optimization.

Obviously

"(ou|u|ch|ce|ci|c)"

needs to take care of ce/ci/c, either by reverse sorting or immediately as tree:

"(c(e|h|i)?|ou|u)"

Then

String soughtKeys = "ou|u|ch|ce|ci|c"; // c last
String replacements = "u|a|c|se|si|k";
Map<String, String> map = new HashMap<>();
... fill map

Pattern pattern = Pattern.compile("(" + soughtKeys + ")");

for (String word : words) {
    StringBuffer sb = new StringBuffer();
    Matcher m = pattern.matcher(word);
    while (m.find()) {
        m.appendReplacement(sb, map.get(m.group());
    }
    m.appendTail(sb);
    System.out.printf("%s -> %s%n", word, sb.toString());
}

The advantage being that regex is quite smart (though slow), and replacements are not done over replaced text.

Upvotes: 0

Jim Garrison
Jim Garrison

Reputation: 86764

This is not a full solution but it shows how to scan the input and find all target substrings in one pass. You would use a StringBuilder to assemble the result, looking up the replacements in a Map as you are currently doing. Use the start and end indexes to handle copying of non-matching segments.

public static void main(String[] args) throws Exception
{
    Pattern p = Pattern.compile("(ou|ch|ce|ci|u|c)");
    Matcher m = p.matcher("auouuchcceaecxici");
    while (m.find())
    {
        MatchResult r = m.toMatchResult();
        System.out.printf("s=%d e=%d '%s'\n", r.start(), r.end(), r.group());
    }
}

Output:

s=1 e=2 'u'
s=2 e=4 'ou'
s=4 e=5 'u'
s=5 e=7 'ch'
s=7 e=8 'c'
s=8 e=10 'ce'
s=12 e=13 'c'
s=15 e=17 'ci'

Note the strings in the regex have to be sorted in order of descending length to work correctly.

Upvotes: 1

Related Questions