pixel
pixel

Reputation: 26493

Find patterns in long string?

I have 40KB HTML page and I want to find certain patterns in it.

I can read it by 1K buffer but I want to avoid situation that pattern that I'm searching would be split between two buffer reads.

How to overcome this problem?

Upvotes: 1

Views: 821

Answers (4)

technoSpino
technoSpino

Reputation: 510

Why not use a SAX parser. It is build to handle large files of mark-up. You would ony run into problems if you are trying to match across different elements along the same level. However this is not impossible to handle

Upvotes: 0

adrianboimvaser
adrianboimvaser

Reputation: 2633

That's what the Scanner class is for.

Upvotes: 2

0xCAFEBABE
0xCAFEBABE

Reputation: 5666

This is easy. You count the longest pattern you will look for, then either backtrack the file pointer by that amount, or you scroll through the file, reading only the delta.

Imagine the longest pattern being 26 bytes.

  1. Read 1k.
  2. Check for all patterns -> nothing.
  3. Drop 1k - 26 bytes from the buffer.
  4. Read 1k - 26 bytes from stream and add to your buffer
  5. Goto 2.

Edit: Let me clarify: There are two methods to do this, both have their merits. The one I documented above is best used if you are reading from a stream, which means a data source that does not support seeking. If, however, your datasource does support seeking (like a filesystem file), you can easily do the same with seeks. Check for pattern, if not found, seek back the size of your longest pattern, then start from there.

If, however, you want to support the search for patterns that are longer than your buffer size, you might need a much more clever algorithm. You would need a lookup table of all patterns that are currently "open" when you contnue to read more data, which in turn will cost more memory - you get the problem.

Upvotes: 3

sblundy
sblundy

Reputation: 61434

You could take a look at CharBuffer, which implements CharSequence for just this purpose

Upvotes: 1

Related Questions