prom85
prom85

Reputation: 17798

Memory efficient replace functions

I am working with a large string that represents an html page and is processed then. What I do is following:

String data = <HTML PAGE CONTENT>;

// remove first/last appostrove
data = data.substring(1, data.length() - 1);
data = StringUtils.replace(data, "\\u003C", "<");
data = StringUtils.replace(data, "\\u003E", ">");
data = StringUtils.replace(data, "\\\"", "\"");
// the head html element is not needed, so I remove it beforehand
data = removeTag(data, "head", true);
// format the data if necessary in utf8 
// => necessary, otherwise I see unwanted characters in my data
data = cleanString(data);

// continue... here I only parse out a list of all relevant tags I'm interested in
// from here on I use a html parser, which is memory efficient...

Problem

For some people I get OOM exceptions, mostly somewhere in between my string process function, so I'm looking to improve them. I appreciate any suggestion that improves my code in memory efficiency (speed is not important!).

Functions

private static String removeTag(String html, String tag, boolean replaceWithEmpty) {
    String regex = "<" + tag + ">.*?</" + tag + ">";
    return StringUtils.replaceAll(html, regex, replaceWithEmpty ? "<" + tag + "></" + tag + ">" : "");
}

private static String cleanString(String s) {
    try {
        // Convert from Unicode to UTF-8
        byte[] utf8 = s.getBytes("UTF-8");
        // Convert from UTF-8 to Unicode
        s = new String(utf8, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        L.e(e);
    }

    return s;
}

StringUtils

public class StringUtils {

    // compile each pattern once only!
    private static HashMap<String, Pattern> COMPILED_PATTERNS = new HashMap<>();

    private static Pattern getPattern(String regex) {
        if (COMPILED_PATTERNS.containsKey(regex)) {
            return COMPILED_PATTERNS.get(regex);
        }
        Pattern p = Pattern.compile(regex);
        COMPILED_PATTERNS.put(regex, p);
        return p;
    }

    public static Matcher match(String regex, String data) {
        Pattern p = getPattern(regex);
        return p.matcher(data);
    }

    public static String replace(final String str, final CharSequence searchChars, CharSequence replaceChars) {
        return str.replace(searchChars, replaceChars);
    }

    public static String replaceAll(final String str, final String regex, String replacement) {
        Pattern p = getPattern(regex);
        return p.matcher(str).replaceAll(replacement);
    }

    public static String findContentBetween(String content, String prefix, String postfix) {
        return findContentBetween(content, prefix, postfix, false);
    }

    public static String findContentBetween(String content, String prefix, String postfix, boolean searchEndFirst) {
        if (content == null || content.length() == 0) {
            return null;
        }

        if (searchEndFirst) {
            int index = content.indexOf(postfix);
            if (index >= 0) {
                int end = -1;
                int start = -1;
                String s;
                while (index >= 0) {
                    s = content.substring(index, index + 1);
                    if (s.equals("?")) {
                        end = index;
                    } else if (s.equals("/")) {
                        start = index + 1;
                    }
                    if (end != -1 && start != -1) {
                        break;
                    }

                    index--;
                }
                if (end > start && end >= 0) {
                    return content.substring(start, end);
                }
            }
        } else {
            int end;
            int start = content.indexOf(prefix);
            if (start > 0) {
                start += prefix.length();
                end = content.indexOf(postfix, start + 1);
                if (end > start) {
                    return content.substring(start, end);
                }
            }
        }
        return null;
    }
}

Upvotes: 1

Views: 502

Answers (2)

Andy Turner
Andy Turner

Reputation: 140319

This answer is addressing the issue when working with a general string. There are better solutions if you're working with HTML.

data = data.substring(1, data.length() - 1);
data = StringUtils.replace(data, "\\u003C", "<");
data = StringUtils.replace(data, "\\u003E", ">");
data = StringUtils.replace(data, "\\\"", "\"");

String is immutable, so each of these strings is necessarily creating a new String (or, it's not doing anything). So, if each of these lines is largely leaving the string unchanged, you're basically just making copies of that string.

Instead, accumulate the updated string in a StringBuilder, doing all of the replacements in one go:

StringBuilder sb = new StringBuilder(data.length());
Map<String, String> replacements = Map.of("\\u003C", "<", "\\u003E", ">" /* etc */);
for (int i = 1; i < data.length() - 1; ++i) {
  sb.append(data.charAt(i));

  for (Map.Entry<String, String> entry : replacements.entrySet()) {
    String search = entry.getKey();

    // This is basically checking "endsWith".
    int endIndex = sb.length() - search.length();
    if (endIndex >= 0 && sb.indexOf(search, endIndex) == endIndex) {
      sb.delete(endIndex, sb.length());
      sb.append(entry.getValue());
    }
   }
}
data = sb.toString();

Note that this is memory efficient, like you've asked for; there are ways to make this more time efficient too.

For example, you could compile a Pattern which matches the things you want to replace:

Pattern p = Pattern.compile(
    replacements.keySet()
        .stream()
        .map(Pattern::quote)
        .collect(Collectors.joining("|")));

and then use the Matcher API, which is well-suited to this task:

Matcher m = p.matcher(data);
int prev = 1;
while (m.find()) {
  sb.append(data, prev, m.start());
  sb.append(replacements.get(m.group()));
  prev = m.end();
}
sb.append(data, prev, data.length() - 1);

Ideone demo

If you wanted to extend the Pattern/Matcher approach to cover the head replacement too, you could append "|<head>[\s\S]*?</head>" to the pattern, and then handle it specially in the loop:

if (!m.group().startsWith("<head>")) {
  sb.append(replacements.get(m.group()));
}

But as you start to descend this path of trying to use regex with HTML, you will quickly find its shortcomings...

Upvotes: 4

MC Emperor
MC Emperor

Reputation: 22977

Regular expressions in combination with large strings is generally not a good idea. Stronger, you shouldn't parse [X]HTML with regex. Especially when the pattern uses capture groups, it must take care of much. In addition, a <div> inside a <div> will break the code.

You could of course grab a StringBuilder, which conserves a part of the memory, but the problem of parsing HTML with regular expressions still exists.


Edit

It is correct that if you apply replacements within large portions of text, many modified copies of the target text are possibly created. However, some of your requirements can be handled by the parser.

  • Removing tags
    You should be able to do something like this:

    Elements selector = docsoup.select("<your tag>");
    for (Element element : selector) {
        element.remove();
    }
    

Upvotes: 1

Related Questions