Reputation: 881

Regular Expression to capture U+XXXX (or \uXXXX) characters in Java

I've been trying to create PDF using PDFBox and it fails with errors like,

java.lang.IllegalArgumentException: U+0083 is not available in this font's encoding: WinAnsiEncoding

I wanted to know if there is a way I could capture such elements using a Regular Expression, I've tried using,

(\\[a-z]00[0-9][0-9])

Which seems to work fine if I validate is using a RegexTool (like RegexBuddy) but doesn't work in Java! I've tried using Java Patterns and Matcher but no success. What I've done with Pattern Matcher is as follows,

import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MatcherFindStartEndExample {

public static void main(String[] args) {

    String text    =
            "IMTECH Ã\u0083Â¢Ã\u0082Â\u0080Ã\u0093Â\u0083 " + 
            "ERCC Ã\u0099Â¢Ã PLANNED ";

    String patternString = "(\\[a-z]00[0-9][0-9])";

    Pattern pattern = Pattern.compile(patternString);
    Matcher matcher = pattern.matcher(text);

    int count = 0;
    while(matcher.find()) {
        count++;
        System.out.println("found: " + count + " : "
                + matcher.start() + " - " + matcher.end());
    }
  }
}

The text I have right now appears something like,

IMTECH Ã\u0083Â¢Ã\u0082Â\u0080Ã\u0093Â\u0083 ERCC Ã\u0099Â¢Ã PLANNED

Thank you

Upvotes: 1

Answers (4)

VGR

Reputation: 44414

I’m not sure exactly what you’re trying to do here, but "\u0083" is a Java escape sequence representing one character, not six characters. It will not be matched by the regular expression "..0083", but it will be matched by ".".

Also, "\\" represents one character in a Java string, the backslash; in a regular expression, that single backslash escapes the following character. In your case "\\[" matches a literal opening square bracket character.

But all of that may be irrelevant, since the presence of sequences like Â¢ suggests you are mishandling UTF-8 data. In particular, it looks like someone has forced every byte to a char value, instead of properly decoding the bytes using a Charset. And then the same mistake was applied a second time to the new, incorrect String.

Working backwards, these characters:

ERCC Ã\u0099Â¢

correspond to these hexadecimal byte values:

45 52 43 43 20 c3 99 c2 a2

If, instead of forcing each byte value to be a char, you decode them as UTF-8 bytes, you get the string "ERCC Ù¢". If we again treat each char as a byte value, we get:

45 52 43 43 20 d9 a2

and if we once again decode those bytes using UTF-8, we get "ERCC ٢", which I am guessing refers to ERCC2.

If you fix the code which is extracting the text, you can match that string with something like Pattern.compile("ERCC\\s*\\d+", Pattern.UNICODE_CHARACTER_CLASS).

Upvotes: 2

phlaxyr

Reputation: 1125

I'm assuming that you want to match for literally "\u0083", and not the unicode character that it represents.
Try (\\\\u00[0-9][0-9]).

Upvotes: 0

achAmháin

Reputation: 4266

This regex will find all backslash + u + 4 digits

String patternString = "[\\\\]+[u]+\\d{4}";

Output:

found: 1 : 8 - 14

found: 2 : 17 - 23

found: 3 : 24 - 30

found: 4 : 31 - 37

found: 5 : 38 - 44

found: 6 : 51 - 57

Upvotes: 0

ospf

Reputation: 300

Your patternString searches literally for a text instead a character. You can search for a list of characters like this:

String patternString = "[\u0083|\u0082|\u0080]";

or for a list like this:

String patternString = "[\u0080-\u0090]";

Upvotes: 0

Regular Expression to capture U+XXXX (or \uXXXX) characters in Java

Answers (4)

Related Questions