I have a data structure which consists of pairs of values, the first of which is an integer and the second of which is an alphanumeric string (which may begin with digits): +--------+-----------------+ | Number | Name | +--------+-----------------+ | 15 | APPLES | | 16 | APPLE COMPUTER | | 17 | ORANGE | | 21 | TWENTY-1 | | 291 | 156TH ELEMENT | +--------+-----------------+ A table of these would comprise up to 100,000 rows. I'd like to provide a lookup function in which the user can look up either the number (as if it were a string), or pieces of the string. Ideally the lookup will be "live" as the user types; after each keystroke (or maybe after a brief delay ~250-500 ms) a new search will be done to find the most likely candidates. So, for example searching on 1 will return 15 APPLES , 16 APPLE COMPUTER , 17 ORANGE , and 291 156TH ELEMENT 15 will narrow the search to 15 APPLES , 291 156TH ELEMENT AP will return 15 APPLES and 16 APPLE COMPUTER (ideally, but not required) ELEM will return 291 156TH ELEMENT . I was thinking about using two Dictionary<string, string> s since ultimately the int s are being compared as string s -- one will index by the integer part and the other by the string part. But really searching by substring shouldn't use a hash function, and it seems wasteful to use twice the memory that I feel like I should need. Ultimately the question is, is there any well-performing way to text search two large lists simultaneously for substrings? Failing that, how about a SortedDictionary ? Might increase performance but still wouldn't solve the hash problem. Thought about creating a regex on the fly, but I would think that would perform terribly. I'm new to C# (having come from the Java world) so I haven't looked into LINQ yet; is that the answer? EDIT 18:21 EST : None of the strings in the "Name" field will be longer than 12-15 characters, if that affects your potential solution.

Reputation: 5763

Which C# data structure allows searching a pair of strings most efficiently for substrings?

I have a data structure which consists of pairs of values, the first of which is an integer and the second of which is an alphanumeric string (which may begin with digits):

+--------+-----------------+
| Number | Name            |
+--------+-----------------+
| 15     | APPLES          |
| 16     | APPLE COMPUTER  |
| 17     | ORANGE          |
| 21     | TWENTY-1        |
| 291    | 156TH ELEMENT   |
+--------+-----------------+

A table of these would comprise up to 100,000 rows.

I'd like to provide a lookup function in which the user can look up either the number (as if it were a string), or pieces of the string. Ideally the lookup will be "live" as the user types; after each keystroke (or maybe after a brief delay ~250-500 ms) a new search will be done to find the most likely candidates. So, for example searching on

1 will return 15 APPLES, 16 APPLE COMPUTER, 17 ORANGE, and 291 156TH ELEMENT
15 will narrow the search to 15 APPLES, 291 156TH ELEMENT
AP will return 15 APPLES and 16 APPLE COMPUTER
(ideally, but not required) ELEM will return 291 156TH ELEMENT.

I was thinking about using two Dictionary<string, string>s since ultimately the ints are being compared as strings -- one will index by the integer part and the other by the string part.

But really searching by substring shouldn't use a hash function, and it seems wasteful to use twice the memory that I feel like I should need.

Ultimately the question is, is there any well-performing way to text search two large lists simultaneously for substrings?

Failing that, how about a SortedDictionary? Might increase performance but still wouldn't solve the hash problem.

Thought about creating a regex on the fly, but I would think that would perform terribly.

I'm new to C# (having come from the Java world) so I haven't looked into LINQ yet; is that the answer?

EDIT 18:21 EST: None of the strings in the "Name" field will be longer than 12-15 characters, if that affects your potential solution.

Upvotes: 6

Answers (3)

Olivier Jacot-Descombes

Reputation: 112762

Since you are searching for the beginning of words, key based collections will not work, unless you store all possible pieces of the words, like "a", "ap", "app", "appl", "apple".

My suggestion is to use a System.Collections.Generic.List<T> in conjunction with a binary search. You would have to provide your own IComparer<T>, which also finds the beginning of words. You would use two data structures.

One List<KeyValuePair<string,int>> holding single words or the number as key and the number as value.

One Dictionary<int,string> holding the whole name.

You would proceed like this:

Split your sentence (the whole name) into single words.
Add them to the list with the word as key and the number as value of the KeyValuePair.
Add the number to the list as key and as value of the KeyValuePair.
When the list is full, sort the list in order to allow a binary search.

Search for a beginning of a word:

Search in the list by using BinarySearch in conjunction with your IComparer<T>.
The index you get from the search might not be the first that applies, so go back in the list until you find the first entry that matches.
Using the number stored as value in the list, look up the whole name in the dictionary using this number as key.

Upvotes: 1

Phil Bolduc

Reputation: 1656

If possible, I would avoid loading all 100,000 entries into memory. I would use either a database or Lucene.Net to index the values. Then use the appropriate query syntax to efficiently search for the results.

Upvotes: 6

doblak

Reputation: 3146

I'd consider using Trie data structure.

How to achieve that? Leaves would represent your "row", but you would have "two paths" to each memory instance of a "row" (one for number and the other one for name).

You can then sacrifice your condition:

(ideally, but not required) ELEM will return 291 156TH ELEMENT.

Or provide even more paths to your row instances.

Upvotes: 3

Which C# data structure allows searching a pair of strings most efficiently for substrings?

Answers (3)

Related Questions