Reputation: 2320
I just created an regular expression in Java, I want to look for expressions in about 5000 tweets, each tweet takes almost one second, why is it so slow??
If it's too complex that expression or there're something on it that it's too expensive to execute? I'd hope to process the whole data in less than 5 seconds for sure.
The code is:
public class RegularExpression {
public static void main(String[] args) throws IOException {
String filter = ".*\"created_at\":\"(.*?)\".*\"content\":\"(.*?word.*?)\",\"id\".*";
Pattern pattern = Pattern.compile(filter);
List<String> tweets = FileUtils.readLines(new File("/tmp/tweets"));
System.out.println("Start with " + tweets.size() );
int i=0;
for (String t : tweets){
Matcher matcher = pattern.matcher(t);
matcher.find();
System.out.println(i++);
}
System.out.println("End");
}
}
The input are JSON tweets. If I do my RE simpler it runs faster, but, I think that my RE isn't so heavy. I'd like to understand why this's happenng, I was just checking a test.
UPDATED:
The reason why I'm using RE when I try to parse JSON, it's because in the end, I could get a simple text, and XML, a JSON format, a log from any kind of server. So, I have to work with my input like plain-text.
Upvotes: 1
Views: 179
Reputation: 5268
I'm not entirely sure why it takes almost a full second to process a single tweet, but lazy quantifiers are more expensive than a "match anything except" approach to a "match until"-scenario.
More information here: http://blog.stevenlevithan.com/archives/greedy-lazy-performance
You could try avoiding the use of lazy quantifiers, or just use a JSON parser instead, as it would likely be faster/cleaner.
Upvotes: 1
Reputation: 41686
Since you already know that your input is JSON, you should not use regular expressions to interpret it. Use a JSON parser, then you don't have to care about anything like escaping special characters.
Upvotes: 2
Reputation: 336498
Your regex is very imprecise in what it allows to match. Most importantly, you seem to be wanting to match text between quotes, but you're allowing quote characters to be part of the match (.*
can and will happily match "
!). This sets you up for a potentially very high number of permutations a regex engine has to check before declaring failure/success, depending on your input.
If in fact quotes may not be part of the text that you're currently matching with .*
, then use [^"]*
instead; that should speed it up a lot:
"[^\"]*\"created_at\":\"([^\"]*)\"[^\"]*\"content\":\"([^\"]*word[^\"]*)\",\"id\"[^\"]*"
Upvotes: 2