Reputation: 881
I've been trying to create PDF using PDFBox and it fails with errors like,
java.lang.IllegalArgumentException: U+0083 is not available in this font's encoding: WinAnsiEncoding
I wanted to know if there is a way I could capture such elements using a Regular Expression, I've tried using,
(\\[a-z]00[0-9][0-9])
Which seems to work fine if I validate is using a RegexTool (like RegexBuddy) but doesn't work in Java! I've tried using Java Patterns and Matcher but no success. What I've done with Pattern Matcher is as follows,
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MatcherFindStartEndExample {
public static void main(String[] args) {
String text =
"IMTECH Ã\u0083¢Ã\u0082Â\u0080Ã\u0093Â\u0083 " +
"ERCC Ã\u0099¢à PLANNED ";
String patternString = "(\\[a-z]00[0-9][0-9])";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);
int count = 0;
while(matcher.find()) {
count++;
System.out.println("found: " + count + " : "
+ matcher.start() + " - " + matcher.end());
}
}
}
The text I have right now appears something like,
IMTECH Ã\u0083¢Ã\u0082Â\u0080Ã\u0093Â\u0083 ERCC Ã\u0099¢à PLANNED
Thank you
Upvotes: 1
Views: 1221
Reputation: 44414
I’m not sure exactly what you’re trying to do here, but "\u0083"
is a Java escape sequence representing one character, not six characters. It will not be matched by the regular expression "..0083"
, but it will be matched by "."
.
Also, "\\"
represents one character in a Java string, the backslash; in a regular expression, that single backslash escapes the following character. In your case "\\["
matches a literal opening square bracket character.
But all of that may be irrelevant, since the presence of sequences like ¢
suggests you are mishandling UTF-8 data. In particular, it looks like someone has forced every byte to a char value, instead of properly decoding the bytes using a Charset. And then the same mistake was applied a second time to the new, incorrect String.
Working backwards, these characters:
ERCC Ã\u0099¢
correspond to these hexadecimal byte values:
45 52 43 43 20 c3 99 c2 a2
If, instead of forcing each byte value to be a char, you decode them as UTF-8 bytes, you get the string "ERCC Ù¢"
. If we again treat each char as a byte value, we get:
45 52 43 43 20 d9 a2
and if we once again decode those bytes using UTF-8, we get "ERCC ٢"
, which I am guessing refers to ERCC2.
If you fix the code which is extracting the text, you can match that string with something like Pattern.compile("ERCC\\s*\\d+", Pattern.UNICODE_CHARACTER_CLASS)
.
Upvotes: 2
Reputation: 1125
I'm assuming that you want to match for literally "\u0083", and not the unicode character that it represents.
Try (\\\\u00[0-9][0-9])
.
Upvotes: 0
Reputation: 4266
This regex will find all backslash
+ u
+ 4 digits
String patternString = "[\\\\]+[u]+\\d{4}";
Output:
found: 1 : 8 - 14
found: 2 : 17 - 23
found: 3 : 24 - 30
found: 4 : 31 - 37
found: 5 : 38 - 44
found: 6 : 51 - 57
Upvotes: 0
Reputation: 300
Your patternString searches literally for a text instead a character. You can search for a list of characters like this:
String patternString = "[\u0083|\u0082|\u0080]";
or for a list like this:
String patternString = "[\u0080-\u0090]";
Upvotes: 0