Reputation: 2485
I'm trying to develop an application which checks for words density within an HTML page. I'm well skilled with Java yet I've never used Lucene. Do you think it is feasible to use Lucene for this purpose? or markup elements contained in the HTML page will cause unefficient searches ? any suggestion is welcome!
thanks!
Upvotes: 0
Views: 65
Reputation: 26733
It would be wise to strip HTML tags and only index the contents. This has already been discussed in SO before. I recommend using JSoup (we're using it for another purpose but are quite happy with it; it's also mentioned in the referenced SO post) but YMMV.
Upvotes: 1