maxfax
maxfax

Reputation: 4305

Delphi: efficient and fast Unicode text search

Is there a fast and efficient text search in a Unicode text/string? I need to search a part of a word too, not only a whole word.

SearchBuf?

Thanks!

Upvotes: 3

Views: 3338

Answers (3)

Warren  P
Warren P

Reputation: 68892

One kind of fast and efficient search algorithm for some cases, is one that (if the data being searched is not changing, and you just search again and again on the same buffer content) searches once and builds some kind of lookup table. One possible solution is BoyerMoore (see link in comment by Rudy) but as that one didn't work great for really high range unicode characters (say range $0000...$FFFF), it was still faster than repeated Pos or SearchBuf calls, but it consumed a LOT of memory when I tested it with truly high-range Unicode data-sets (Chinese text for instance). My opinion is there is no "slam dunk" solution.

Perhaps you could write a faster-than-Pos-but-not-so-much-memory-as-BoyerMoore solution like this:

  1. Build a table for all character points and store the first location that each character appears. I would use a sparse sorted array, for each search you have done, store each starting location for that initial character. That will spare you the brute force search through the big unicode string looking for the initial character matches.

  2. If the table contains NO entry for a particular character, you have to do a brute force search (just once) to build that data set up. If you search once by brute force and codepoint U+1234 does not appear, then the entry for U+1234 should be an empty array [] indicating that you don't have to search again, and can quickly fail out on searches where the initial character doesn't match.

  3. Once you have found a non-empty search initial-character list of positions like this [123,456,789], an initial codepoint match, you can continue by doing a character by character check at string[123...x] then string[456..x] and so on until you run out of elements in the array. Any mismatch causes a skip to the next element in the table of lookups.

Again there will be pathological cases where all this extra work results in code that is less fast than Pos is, and unless you can be sure you need to search for a lot of different substrings in exactly the same big string, then all this "optimization" stuff is a waste of time; forget it, even boyer moore string searches are slower, when you can't reuse the lookup data table multiple times at least.

If all you want is a Pos() function that is hand-optimized in Assembly to work as fast as is possible within the confines of a library function that doesn't produce or consume any intermediate table storage, then congratulations, Pos() is already probably about as fast as you can get (I believe it was optimized by the FastCode people a few years ago).

Upvotes: 2

Arnaud Bouchez
Arnaud Bouchez

Reputation: 43033

From its implementation, SearchBuf is slower than Pos/PosEx. But it does have other options, like whole word search and case insensitive search.

For your purpose, the UnicodeString version of Pos is IMHO slower than PosEx (Pos use slowest rep scansw asm instead of two widechar-unrolled compare for PosEx - at least in Delphi 2010). Since I guess you'd like to have a start offset for your search (creating a sub-string for calling Pos is dead slow), you'd like to use PosEx.

A Bower-More algorithm could be probably faster. You have to try in your application, on real data, to guess if it's worth it.

And if you want to search for English text, using UTF-8 for your storage is worth making a try. The search will be faster, since you'll use less memory.

But I guess that the main bottleneck of your application will be in the storage / retrieval part (i.e. access to disk or un/compression), not in the search part. Use a software profiler to guess what is worth trying to enhance:

  • Make it right before you make it fast. Make it clear before you make it faster. Keep it right when you make it faster. — Kernighan and Plauger, Elements of Programming Style.
  • Premature optimization is the root of all evil. — Donald Knuth, quoting C. A. R. Hoare
  • The key to performance is elegance, not battalions of special cases. The terrible temptation to tweak should be resisted. — Jon Bentley and Doug McIlroy
  • The rules boil down to: "1. Don't optimize early. 2. Don't optimize until you know that it's needed. 3. Even then, don't optimize until you know what's needed, and where." — Herb Sutter

Upvotes: 3

PhiS
PhiS

Reputation: 4650

As has already been pointed out, the fastest way of doing this depends on a number of things, most importantly whether you need to be able to search repeatedly or not. The second question is how important is it to you to really have the "fastest" approach rather than a reasonably fast approach and the amount of time you are willing to invest in optimisations.

Repeated searches

If you need to search repeatedly, the most efficient way for string searching I know of is by the use of suffix arrays (often combined with Burrows-Wheeler transforms). This approach is used extensively in bioinformatics where one often has to deal with a huge number of string searches over really large data sets (e.g. here). A very good suffix array (and BWT) library is the libdivsufsort C library, but unfortunately I know of no Delphi port of this library. (I believe this library is capable of handling unicode strings.)

Single searches

If you don't need to search repeatedly, a brute-force string search algorithm can be efficient, for instance the assembly-optimised FastCode versions of Pos and friends. These were, however, written before Delphi was unicode-ified and I know of no similar optimised unicode implementations. If I were to write one today and wanted to optimise it for a modern processor (capable of the SSE4.2 instruction set), I would have a serious look at the PCMPESTRI assembly instruction (reference pdf here; see also e.g. here, but I have no idea whether that code is working), which can handle the 2-byte characters you'd need for unicode string searching.

Upvotes: 8

Related Questions