Kevin
Kevin

Reputation: 3431

Java won't match .*

I have the following line in a file

00241386002|5296060|0|1|ClaimNote|29DEC2005:10:20:13.557194|JAR007|

I'm trying to match with

line.matches("^\d+\|\d+\|\d+\|\d+.+$")

That pattern works on the previous ~10k or so lines in the file. It also works on the immediately preceding line which is the same up through the timestamp. It does not, however, work on that line. Even

line.matches(".*")

returns false.

Any help would be appreciated.

edits:

answer:

Upvotes: 5

Views: 307

Answers (1)

Pshemo
Pshemo

Reputation: 124275

Problem

\d+\|\d+\|\d+\|\d+ part of your regex seems to be working fine which suggests that problem must be related to .* part.

Lets test which characters can't by default by matched by . which could prevent matches from returning true.
(I will test only characters in range 0-FFFF but Unicode have more characters - like surrogate pairs - so I am not saying that these are only characters which . can't match - even if it is today we can't be sure about the future).

for (int ch = 0; ch < '\uFFFF'; ch++) {
    if (!Character.toString((char)ch).matches(".*")) {
        System.out.format("%-4d hex: \\u%04x %n", ch, ch);
    }
}

We will get as result (added some comments and links)

10 hex: \u000a - line feed (\n)
13 hex: \u000d - carriage return (\r)
133 hex: \u0085 - next line (NEL)
8232 hex: \u2028 - line separator
8233 hex: \u2029 - paragraph separator

So I suspect that your string contains one of these characters. Now, not all tools properly recognize these characters as proper line separators (which regex recognizes). For instance, lets test BufferedReader

String data = "AAA\nBBB\rCCC\u0085DDD\u2028EEE\u2029FFF";

BufferedReader br = new BufferedReader(new StringReader(data));
String line = null;
while((line = br.readLine())!=null){
    System.out.println(line);
}

we are getting as result:

AAA
BBB
CCCDDD
    EEE
    FFF
   ⬑ here we have `\u0085` (NEL) 

As you see tools which are not based on regex engine can return string which will represent single line, but still will contain characters which regex sees as line separators.

Possible solutions

We can try to let . match any characters. To do so we can use Pattern.DOTALL flag (we can enable it also by adding (?s) in regex like (?s).*).

Also as you already mention your question, we can set regex engine in Pattern.UNIX_LINES mode ((?d) flag), which will make it see only \n as line separator (other characters like \r will not be treated as line separators).

Upvotes: 5

Related Questions