WiWeber
WiWeber

Reputation: 241

Regexp extract text beween two tags

I try to parse a text like

0:ID IN (1002,25);1:ID IN (2,3,4) AND COQ>=0 AND COQ<=9;2:ID IN
(73150,73150) AND TOTAL>=0 AND TOTAL<=99999

in Java. I need the text content between the numbers including the colon and the semicolon or eol:

0:<--this-->;1:<--this-->;2:<--this-->
  • ID IN (1002,25)
  • ID IN (2,3,4) AND CO>=0 AND CO<=9
  • ID IN (73150,73150) AND TOTAL>=0 AND TOTAL<=99999

an additional problem occurs because it is possible that whitespaces will be between the numbers and the colons

 0    :ID IN (1002,25);   1   :ID IN (2,3,4) AND COQ>=0 AND COQ<=9;2:ID IN
(73150,73150) AND TOTAL>=0 AND TOTAL<=99999

i tried (?<=[\d]:).*(?=;|$) and (?<=:).*(?=;|$)

but both terms does not solve the problem, because they ignore digits with colons between the first and last appearance:

and they would not ignore digits with colons placed in captions (next problem but in my case a minor one):

 0    :NAME = '3:;' OR NAME = "0 :  ;" ;   1   :CO>=0;2:TOTAL<=99999
  • NAME = '3:;' OR NAME = "0 : ;"
  • CO>=0
  • TOTAL<=99999

i would be very cool, if you have a good advice to solve this tricky problem. merci merci pavoo

Upvotes: 3

Views: 83

Answers (5)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627468

You can use a more complex regex like this to consider all your corner cases:

(?:^\s*(?!\s*\d+\s*:)|\d+\s*:)((?:'[^']*'|"[^"]*"|[^;])+)

See demo

The (?:^\s*(?!\s*\d+\s*:)|\d+\s*:) subpattern matches the starting subtexts (beginning of string with optional whitespace not followed by optional whitespace, digits, optional whitespace and a colon, or digits followed by optional whitespace and a colon), and then either characters other than ;, or strings inside "..." or '...'.

IDEONE demo:

String s = "0:ID IN (1002,25);1:ID IN (2,3,4) AND COQ>=0 AND COQ<=9;2:ID IN (73150,73150) AND TOTAL>=0 AND TOTAL<=99999";
Pattern pattern = Pattern.compile("(?:^\\s*(?!\\s*\\d+\\s*:)|\\d+\\s*:)((?:\'[^\']*\'|\"[^\"]*\"|[^;])+)");
Matcher matcher = pattern.matcher(s);
System.out.println("Match 1:\n");
while (matcher.find()){
    System.out.println(matcher.group(1)); 
} 

Upvotes: 1

sp00m
sp00m

Reputation: 48837

Playing with lookarounds, this one should suit your needs:

(?<=:).*?(?=;|$)

Regular expression visualization

Visualization by Debuggex

Demo on regex101

Don't forget to enable the dotall mode.


In Java:

Pattern pattern = Pattern.compile("(?<=:).*?(?=;|$)", Pattern.DOTALL);
Matcher matcher = pattern.matcher(yourInputString);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Upvotes: 2

SomeJavaGuy
SomeJavaGuy

Reputation: 7357

I came up with this solution, but i am need to replace the 0: in the first array element.

String s = "0    :NAME = '3:;' OR NAME = \"0 :  ;\" ;   1   :CO>=0;2:TOTAL<=99999";

for (String s2 : s.split(";\\s*\\d\\s*:")) {
     System.out.println(s2.replaceAll("^(\\s*\\d\\s*:)", ""));
}

s = "0:ID IN (1002,25);1:ID IN (2,3,4) AND COQ>=0 AND COQ<=9;2:ID IN (73150,73150) AND TOTAL>=0 AND TOTAL<=99999";

for (String s2 : s.split(";\\s*\\d\\s*:")) {
    System.out.println(s2.replaceAll("^(\\s*\\d\\s*:)", ""));
}

From what i see it should be getting the correct results.

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89639

You can split your string in this way:

String parts[] = s.split(";?\\s*\\d+\\s*:");

Upvotes: 1

Maroun
Maroun

Reputation: 96018

Try the following regex:

\d+\s*:([^;]+)

The captured group is what you want.

Upvotes: 1

Related Questions