Reputation: 1091
So here's a problem and question:
I'm analyzing a document page on HTML5 Canvas and detecting certain features, such as boxes, labels, text blocks, images, tables, etc. Because Canvas is slow for pixel read/write and the image needs to be high-res for good accuracy e.g.: 1500 x 2500, I cannot afford to analyze every pixel, let alone in multiple passes.
My algorithm does some random pixel pokes and does some minimal analysis to find if there is a usable bounding box for further processing and the type of processing that needs to be done; some parts may be sent to the server, like OCR.
Every subsequent random poke checks against a growing list of successfully found bounding boxes and pokes elsewhere until it gets into uncharted waters. The technique is surprisingly simple and effective, but this results in a lot of extra random pokes and does not provide consistent results without large poke counts (1% of area), and even then it misses some parts intermittently.
What would be great is to implement some spatial analysis algorithm that can tell me where the unpoked areas are outside of all bounding boxes, so that I can restrict my x/y random coordinate selection to there only. It should increase the efficacy and speed by a significant amount.
My typical box count for a fully analyzed doc page is < 200.
Does any algorithm exist in the public domain/wiki that can do this in JavaScript reasonably fast?
Upvotes: 0
Views: 242
Reputation: 1871
Some thoughts that I hope might help. A broad idea it still needs some work!
Assumptions are that no bounding boxes overlap and are found one at a time.
The following would turned into a recursive procedure 'Check' on documents which halts when documents are too small to continue.
Check (document)
If (document is root document) {
Find a bounding box in document
Split document horizontally into 4 new documents
for each new document Check ( new document)
}
else {
Find a bounding box in document
if (bounding box wholly inside document) {
Split document horizontally into 4 new documents
for each new document Check ( new document)
}
else {
Split parent document into 4 vertically
Using information found about position of bounding box
Check (appropriate vertical document)
for each of other vertical documents check(document)
}
The following is a pdf file to help illustrate the idea.
Upvotes: 1