ccoutinho
ccoutinho

Reputation: 4544

Splitting strings in Java: lookahead and lookbehind with variable length

I want to break a String in Java using numbers as delimiters, but keep the numbers. A bit of research has shown me that using the split method() from String would be appropriate, but I failed to understand how to do so. To further explain my question I'll use an example:

Input: 20.55|50|0.5|20|20.55

Required Output: ["20.55","|","50","|","0.5","|","20","|","20.55"]

By invoking the method split like the example I present below, without lookahead and lookbehind, I get the output I was expecting

expression.split("([0-9]+(\\.[0-9]+)?)")

Output: ["|","|","|","|"]

But if I try to do that with lookahead:

expression.split("(?=([0-9]+(\\.[0-9]+)?))")

Output: ["2","0.","5","5|","5","0|","0.","5|","2","0|","2","0.","5","5"]    

And by using lookbehind I get an exception:

Exception in thread "main" java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 22 (?<=([0-9]+(.[0-9]+)?))

Can anyone explain me this behaviour, and suggest a solution?

PS: I know I can use the '|' to break the string, but this is just a silly example, I actually need a much more complex regex...

EDIT:

Seems to be impossible to do what I want because of the length of the delimiters. Since I was looking for a solution to a smaller problem, which I could then use to the remaining of the exercise, I will rephrase to see if there's a turnaround, like the one found in the second and third answers:

I want to break a String in Java containing an arithmetic expression, and keep all its items. For example:

Input: 20.55 * 0.5 ** cos(360) + sin 0 * cos 90 + 1 * sin (180 + 90) * 0
Output: ["20.55", "*", "0.5", "**", "cos", "(", "360", ")", "+", "sin", "0", "*", "cos", "90", "+", "1", "*", "sin", "(", "180", "+", "90", ")", "*", "0"] 

PSS: please note that I have to use '**' for the exponentiation.

EDIT 2 Following the answer given by anubhava, a solution was found to break an arithmetic expression on all its items

Pattern p = Pattern.compile( "\\*\\*|sin|cos|tan|\\d+(?:\\.\\d+)?|[-()+*/%]" );
Matcher matcher = p.matcher(expression);

while(matcher.find())
    System.out.println(matcher.group());

Upvotes: 3

Views: 1183

Answers (3)

anubhava
anubhava

Reputation: 784958

You can use this lookaround based regex for splitting:

String[] toks = "20.55|50|0.5|20|20.55".split( "(?=[^\\d.])|(?<=[^\\d.])" );

for (String tok: toks)
    System.out.printf("%s%n", tok);

RegEx Demo


Update:

You can use this regex for matching your tokens:

Pattern p = Pattern.compile( "sin|cos|tan|\\d+(?:\\.\d+)?|[-()+*/%]" );

You can then use Matcher#find() method in a while loop to get all the matched tokens.

Upvotes: 2

m.cekiera
m.cekiera

Reputation: 5395

Try with:

(?<=\d)(?=\|)|(?<=\|)(?=\d)

DEMO

In Java:

public class RegexTest{
    public static void main(String[] args){
        String string = "20.55|50|0.5|20|20.55";
        System.out.println(Arrays.toString(string.split("(?<=\\d)(?=\\|)|(?<=\\|)(?=\\d)")));
    }
}

with result:

[20.55, |, 50, |, 0.5, |, 20, |, 20.55]

EDIT

To use other characters as delimeters to include "*", "sin" ,etc., you can change regex to:

(?<=[0-9a-z*])(?=\|)|(?<=\|)(?=[0-9a-z*])

DEMO

where [0-9a-z*] means digit, letter or "*". If you want to include other characters, just add it to character class, like [0-9a-z*E], etc.

Upvotes: 1

ndnenkov
ndnenkov

Reputation: 36101

The problem is that you can't define lookbehinds with variable length. +, * and ? all match a variable amount of characters. This is a limitation of most regex engines.

You can have lookaheads with variable length however. But in your case, this wont do the job, because lookarounds don't consume already matched data.

You want something that does:

([0-9]+(\\.[0-9]+)?)\\K

What \K does is just throw away what was already matched. Therefore, you will still split by certain positions and won't repeat yourself with the floating numbers.

Upvotes: 1

Related Questions