Reputation: 609
In my project, I need to download a html (about 50K-100K long when read into String, yes, quite fat), and fetch some contents using regular expressions.And then insert them into the database. The performance is quite bad, and I want to know why.
The process of the codes is like that (multithreaded):
Pattern p = Pattern.compile("<h.*</a></h.>",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(html);
boolean result = m.find();
while (result) {
//insert into database stuff
//update database stuff
}
The string is very long, but if I split it into pieces, some matches may be missed. This is quite disturbing.
I added some print lines and find that after inserting into database, there are some delays, before updating operations, but I can't figure it out as the connection to the database isn't closed.
Upvotes: 1
Views: 167
Reputation: 1757
Use a profiler, such as VisualVM. It will show you exactly what method is taking up time.
In your case, it's a pretty safe bet that your approach of using a regex is not ideal.
Edit: I disagree it's too early for a profiler. You can monitor your threads, and see if they're waiting for locks. Also, the profiler will show memory statistics and CPU utilization - so you'll know that it is the application. A profiler is the perfect tool to use.
Upvotes: 1
Reputation:
Stop right there.
You are committing one of the worst sins that it's possible to when performance tuning.
You are assuming that the performance problem is where you think it is in the code.
You do not know that, and until you have hard evidence, you could be optimizing the wrong thing - and may well be making the situation worse.
First of all, you need to confirm that the problem is application code. As this is a multithreaded application, which is downloading data (over a network) and inserting into a database (over a network), then you first need to rule out issues to do with thread monitors / locks and network / IO issues.
It's too early to even use a profiler. If you profile now, you could be missing things.
1) If you don't have the GC switches on, put them on now. Production Java applications should never run without GC logging.
2) Rerun your test case, with vmstat 1 running (if it's Unix) or Task Manager (if it's Windows).
3) Update your question with details of whether the CPU utilisation goes to 100% during the test run, and we can do the next step.
Upvotes: 2
Reputation: 55866
Try avoid Regex, use standard HTML Parser like JSoup, there are many. They might be more efficient, at least more than Regex, I would hope.
If using regex, try not compiling regex each time. Can have a private static for the Pattern
. But this ain't huge gain in performance, just good practice.
Use connection pooling for Database. If possible do batch inserts.
Upvotes: 2
Reputation: 5253
Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.
Also Pattern Matching for parsing HTML is always a tedious task..because in regex long strings are divided into groups and sub-groups and then each group and sub-group is matched for Pattern..May be thats why your performance is slow..
Use an HTML Parser instead. See also What are the pros and cons of the leading Java HTML parsers?
Upvotes: 1