Reputation: 17798
I am working with a large string that represents an html page and is processed then. What I do is following:
String data = <HTML PAGE CONTENT>;
// remove first/last appostrove
data = data.substring(1, data.length() - 1);
data = StringUtils.replace(data, "\\u003C", "<");
data = StringUtils.replace(data, "\\u003E", ">");
data = StringUtils.replace(data, "\\\"", "\"");
// the head html element is not needed, so I remove it beforehand
data = removeTag(data, "head", true);
// format the data if necessary in utf8
// => necessary, otherwise I see unwanted characters in my data
data = cleanString(data);
// continue... here I only parse out a list of all relevant tags I'm interested in
// from here on I use a html parser, which is memory efficient...
Problem
For some people I get OOM exceptions, mostly somewhere in between my string process function, so I'm looking to improve them. I appreciate any suggestion that improves my code in memory efficiency (speed is not important!).
Functions
private static String removeTag(String html, String tag, boolean replaceWithEmpty) {
String regex = "<" + tag + ">.*?</" + tag + ">";
return StringUtils.replaceAll(html, regex, replaceWithEmpty ? "<" + tag + "></" + tag + ">" : "");
}
private static String cleanString(String s) {
try {
// Convert from Unicode to UTF-8
byte[] utf8 = s.getBytes("UTF-8");
// Convert from UTF-8 to Unicode
s = new String(utf8, "UTF-8");
} catch (UnsupportedEncodingException e) {
L.e(e);
}
return s;
}
StringUtils
public class StringUtils {
// compile each pattern once only!
private static HashMap<String, Pattern> COMPILED_PATTERNS = new HashMap<>();
private static Pattern getPattern(String regex) {
if (COMPILED_PATTERNS.containsKey(regex)) {
return COMPILED_PATTERNS.get(regex);
}
Pattern p = Pattern.compile(regex);
COMPILED_PATTERNS.put(regex, p);
return p;
}
public static Matcher match(String regex, String data) {
Pattern p = getPattern(regex);
return p.matcher(data);
}
public static String replace(final String str, final CharSequence searchChars, CharSequence replaceChars) {
return str.replace(searchChars, replaceChars);
}
public static String replaceAll(final String str, final String regex, String replacement) {
Pattern p = getPattern(regex);
return p.matcher(str).replaceAll(replacement);
}
public static String findContentBetween(String content, String prefix, String postfix) {
return findContentBetween(content, prefix, postfix, false);
}
public static String findContentBetween(String content, String prefix, String postfix, boolean searchEndFirst) {
if (content == null || content.length() == 0) {
return null;
}
if (searchEndFirst) {
int index = content.indexOf(postfix);
if (index >= 0) {
int end = -1;
int start = -1;
String s;
while (index >= 0) {
s = content.substring(index, index + 1);
if (s.equals("?")) {
end = index;
} else if (s.equals("/")) {
start = index + 1;
}
if (end != -1 && start != -1) {
break;
}
index--;
}
if (end > start && end >= 0) {
return content.substring(start, end);
}
}
} else {
int end;
int start = content.indexOf(prefix);
if (start > 0) {
start += prefix.length();
end = content.indexOf(postfix, start + 1);
if (end > start) {
return content.substring(start, end);
}
}
}
return null;
}
}
Upvotes: 1
Views: 502
Reputation: 140319
This answer is addressing the issue when working with a general string. There are better solutions if you're working with HTML.
data = data.substring(1, data.length() - 1);
data = StringUtils.replace(data, "\\u003C", "<");
data = StringUtils.replace(data, "\\u003E", ">");
data = StringUtils.replace(data, "\\\"", "\"");
String is immutable, so each of these strings is necessarily creating a new String (or, it's not doing anything). So, if each of these lines is largely leaving the string unchanged, you're basically just making copies of that string.
Instead, accumulate the updated string in a StringBuilder
, doing all of the replacements in one go:
StringBuilder sb = new StringBuilder(data.length());
Map<String, String> replacements = Map.of("\\u003C", "<", "\\u003E", ">" /* etc */);
for (int i = 1; i < data.length() - 1; ++i) {
sb.append(data.charAt(i));
for (Map.Entry<String, String> entry : replacements.entrySet()) {
String search = entry.getKey();
// This is basically checking "endsWith".
int endIndex = sb.length() - search.length();
if (endIndex >= 0 && sb.indexOf(search, endIndex) == endIndex) {
sb.delete(endIndex, sb.length());
sb.append(entry.getValue());
}
}
}
data = sb.toString();
Note that this is memory efficient, like you've asked for; there are ways to make this more time efficient too.
For example, you could compile a Pattern
which matches the things you want to replace:
Pattern p = Pattern.compile(
replacements.keySet()
.stream()
.map(Pattern::quote)
.collect(Collectors.joining("|")));
and then use the Matcher
API, which is well-suited to this task:
Matcher m = p.matcher(data);
int prev = 1;
while (m.find()) {
sb.append(data, prev, m.start());
sb.append(replacements.get(m.group()));
prev = m.end();
}
sb.append(data, prev, data.length() - 1);
If you wanted to extend the Pattern
/Matcher
approach to cover the head
replacement too, you could append "|<head>[\s\S]*?</head>"
to the pattern, and then handle it specially in the loop:
if (!m.group().startsWith("<head>")) {
sb.append(replacements.get(m.group()));
}
But as you start to descend this path of trying to use regex with HTML, you will quickly find its shortcomings...
Upvotes: 4
Reputation: 22977
Regular expressions in combination with large strings is generally not a good idea. Stronger, you shouldn't parse [X]HTML with regex. Especially when the pattern uses capture groups, it must take care of much. In addition, a <div>
inside a <div>
will break the code.
You could of course grab a StringBuilder, which conserves a part of the memory, but the problem of parsing HTML with regular expressions still exists.
It is correct that if you apply replacements within large portions of text, many modified copies of the target text are possibly created. However, some of your requirements can be handled by the parser.
Removing tags
You should be able to do something like this:
Elements selector = docsoup.select("<your tag>");
for (Element element : selector) {
element.remove();
}
Upvotes: 1