Javangelist
Javangelist

Reputation: 1

Java String.split() spiralling out of control

I am trying to parse strings (some can be very long, paragraphs) based on white space (spaces, return keys, tabs). Currently using String.split("\\s++"). In the previous project we are updating, we had simply used StringTokenizer. Using String.split("\\s++") works just fine in all our testing and with all our beta testers.

The minute we release it to expanded users, it runs for a while until it soaks up all server resources. From what I've researched, it appears to be catastrophic backtracking. We get errors like:

    ....was in progress with [email protected]/java.util.regex.Pattern$GroupHead.match(Pattern.java:4804)
    [email protected]/java.util.regex.Pattern$Start.match(Pattern.java:3619)
    [email protected]/java.util.regex.Matcher.search(Matcher.java:1729)
    [email protected]/java.util.regex.Matcher.find(Matcher.java:746)
    [email protected]/java.util.regex.Pattern.split(Pattern.java:1264)
    [email protected]/java.lang.String.split(String.java:2317)

Users can type some crazy text. What is the best option to parse strings that could be anywhere from 10 characters to 1000 characters long? I am at a brick wall. Been trying different patterns (regex is not my strongest area) for the past 4 days without long term success.

Upvotes: 0

Views: 112

Answers (1)

Deadron
Deadron

Reputation: 5289

The simple solution if you dont trust the regex is to use a non regex based solution such as ApacheCommons StringUtils#split. Alternatively, its pretty easy to write one yourself.

Keep in mind the difference between using StringTokenizer versus a split function is the tokenizer is lazy. If you were only retrieving a subset of the split results you may be eating up more memory with a split. I would only expect this to be a problem with large strings though.

Upvotes: 1

Related Questions