Maxsteel
Maxsteel

Reputation: 2040

Escape utf8 using regex

I did a data processing job and unescaped data by mistake. It replaced all utf-8 like \x0a\xfa to x0axfa I want to write a regex to place those \ back before x. I tried this:

`regex:((\([\x00-\x7F]\)|\w){2})+`
replace with: \\$1

However, it replaces everything before last 2 characters with \. What's the correct way to solve this problem. (I have to do regex replace, cannot do data processing again. Its huge.)

Input: blah blah x0ax0fx12...

Desired Output: blah blah \x0a\x0f\x12...

Upvotes: 0

Views: 267

Answers (2)

Bohemian
Bohemian

Reputation: 425238

Use a look behind to prevent escaping already escaped stuff, and a look ahead to find the insertion point for the slash:

str = str.replaceAll("(?<!\\\\)(?=x[a-z0-9]{2,})", "\\\\");

The quadruple backslash is needed for a literal backslash in a Java regex; escaped once for the regex, then each one again for the string literal.

Upvotes: 1

terafl0ps
terafl0ps

Reputation: 704

In a case like this, I would use an expression like (x[0-9A-Fa-f]{1,4})+ to identify the chunk of UTF-8 data without backslashes on each line.

From there, you can use Java's string.split("x") to make an array of strings representing the bytes without the "x". If regexMatch is a string containing the match from your expression like "x0ax0fx12", then you could do something like this:

import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
    public static void main(String args[]) {
        String inputText = "blah blah x0ax0fx12 blah blah";
        String regexMatch = "";
        Pattern pattern = Pattern.compile("(x[0-9A-Fa-f]{1,4})+");
        Matcher matcher = pattern.matcher(inputText);                
        if (matcher.find()) {
            regexMatch = matcher.group(0);
        }                
        String replacedOutput = "";
        for (String splitStr : regexMatch.split("x")) {            
            if (!splitStr.equals("")) {                
                replacedOutput += "\\x" + splitStr;
            }            
        }        
      System.out.println(replacedOutput); 
   }
}

This should output "\x0a\x0f\x12" and you should be able to substitute it back into the line that matched in your file at the point where the matcher found it.

Upvotes: 0

Related Questions