Reputation: 26493
I have 40KB HTML page and I want to find certain patterns in it.
I can read it by 1K buffer but I want to avoid situation that pattern that I'm searching would be split between two buffer reads.
How to overcome this problem?
Upvotes: 1
Views: 821
Reputation: 510
Why not use a SAX parser. It is build to handle large files of mark-up. You would ony run into problems if you are trying to match across different elements along the same level. However this is not impossible to handle
Upvotes: 0
Reputation: 5666
This is easy. You count the longest pattern you will look for, then either backtrack the file pointer by that amount, or you scroll through the file, reading only the delta.
Imagine the longest pattern being 26 bytes.
Edit: Let me clarify: There are two methods to do this, both have their merits. The one I documented above is best used if you are reading from a stream, which means a data source that does not support seeking. If, however, your datasource does support seeking (like a filesystem file), you can easily do the same with seeks. Check for pattern, if not found, seek back the size of your longest pattern, then start from there.
If, however, you want to support the search for patterns that are longer than your buffer size, you might need a much more clever algorithm. You would need a lookup table of all patterns that are currently "open" when you contnue to read more data, which in turn will cost more memory - you get the problem.
Upvotes: 3
Reputation: 61434
You could take a look at CharBuffer
, which implements CharSequence
for just this purpose
Upvotes: 1