Reputation: 1
I am trying to parse strings (some can be very long, paragraphs) based on white space (spaces, return keys, tabs). Currently using String.split("\\s++")
. In the previous project we are updating, we had simply used StringTokenizer
. Using String.split("\\s++")
works just fine in all our testing and with all our beta testers.
The minute we release it to expanded users, it runs for a while until it soaks up all server resources. From what I've researched, it appears to be catastrophic backtracking. We get errors like:
....was in progress with [email protected]/java.util.regex.Pattern$GroupHead.match(Pattern.java:4804)
[email protected]/java.util.regex.Pattern$Start.match(Pattern.java:3619)
[email protected]/java.util.regex.Matcher.search(Matcher.java:1729)
[email protected]/java.util.regex.Matcher.find(Matcher.java:746)
[email protected]/java.util.regex.Pattern.split(Pattern.java:1264)
[email protected]/java.lang.String.split(String.java:2317)
Users can type some crazy text. What is the best option to parse strings that could be anywhere from 10 characters to 1000 characters long? I am at a brick wall. Been trying different patterns (regex is not my strongest area) for the past 4 days without long term success.
Upvotes: 0
Views: 112
Reputation: 5289
The simple solution if you dont trust the regex is to use a non regex based solution such as ApacheCommons StringUtils#split. Alternatively, its pretty easy to write one yourself.
Keep in mind the difference between using StringTokenizer versus a split function is the tokenizer is lazy. If you were only retrieving a subset of the split results you may be eating up more memory with a split. I would only expect this to be a problem with large strings though.
Upvotes: 1