Reputation: 2040
I did a data processing job and unescaped data by mistake. It replaced all utf-8 like \x0a\xfa to x0axfa
I want to write a regex to place those \
back before x. I tried this:
`regex:((\([\x00-\x7F]\)|\w){2})+`
replace with: \\$1
However, it replaces everything before last 2 characters with \
. What's the correct way to solve this problem. (I have to do regex replace, cannot do data processing again. Its huge.)
Input: blah blah x0ax0fx12...
Desired Output: blah blah \x0a\x0f\x12...
Upvotes: 0
Views: 267
Reputation: 425238
Use a look behind to prevent escaping already escaped stuff, and a look ahead to find the insertion point for the slash:
str = str.replaceAll("(?<!\\\\)(?=x[a-z0-9]{2,})", "\\\\");
The quadruple backslash is needed for a literal backslash in a Java regex; escaped once for the regex, then each one again for the string literal.
Upvotes: 1
Reputation: 704
In a case like this, I would use an expression like (x[0-9A-Fa-f]{1,4})+
to identify the chunk of UTF-8 data without backslashes on each line.
From there, you can use Java's string.split("x") to make an array of strings representing the bytes without the "x". If regexMatch
is a string containing the match from your expression like "x0ax0fx12", then you could do something like this:
import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]) {
String inputText = "blah blah x0ax0fx12 blah blah";
String regexMatch = "";
Pattern pattern = Pattern.compile("(x[0-9A-Fa-f]{1,4})+");
Matcher matcher = pattern.matcher(inputText);
if (matcher.find()) {
regexMatch = matcher.group(0);
}
String replacedOutput = "";
for (String splitStr : regexMatch.split("x")) {
if (!splitStr.equals("")) {
replacedOutput += "\\x" + splitStr;
}
}
System.out.println(replacedOutput);
}
}
This should output "\x0a\x0f\x12" and you should be able to substitute it back into the line that matched in your file at the point where the matcher found it.
Upvotes: 0