Reputation: 19602
I have lots of strings containing text in lots of different spellings. I am tokenizing these strings by searching for keywords and if a keyword is found I use an assoicated text for that keyword.
Let's say the search string can contain the text "schw.", "schwa." and "schwarz". I have three keywords that all resolve to the text "schwarz".
Now I'm searching for an effective way to find all the keywords without doing a string.Contains(keyword) for every single keyword.
Sample data:
H-Fuss ahorn 15 cm/SH48cm
Metall-Fuss chrom 9 cm/SH42cm
Metall-Kufe alufbg.12 cm/SH45c
Metall-Kufe verchr.12 cm/SH45c
Metall-Zylind.aluf.12cm/SH45cm
Kufe alufarbig
Metall-Zylinder hoch alufarbig
Kunststoffgl.schw. - hoch
Kunststoffgl.schw. - Standard
Kunststoffgleiter - schwarz für Sitzhoehe 42 cm
Sample keywords (key, value):
h-fuss, Holz
ahorn, Ahorn
metall, Metall
chrom, Chrom
verchr, Chrom
alum, Aluminium
aluf, Aluminium
kufe, Kufe
zylind, Zylinder
hoch, Hoch
kunststoffgl, Gleiter
gleiter, Gleiter
schwarz, Schwarz
schw., Schwarz
Sample result:
Holz, Ahorn
Metall, Chrom
Metall, Kufe, Aluminium
Metall, Kufe, Chrom
Metall, Zylinder, Aluminium
Kufe, Aluminium
Metall, Zylinder, Hoch, Aluminium
Gleiter, Schwarz, Hoch
Gleiter, Schwarz
Gleiter, Schwarz
Upvotes: 9
Views: 10841
Reputation: 19938
I would use precompiled regular expressions for each group of keywords to match. In the background these are "compiled" to finite automata, so they are pretty fast in recognizing the pattern in your string and much faster than a Contains
for each of the possible strings.
using: System.Text.RegularExpressions
.
In your example:
new Regex(@"schw(a?\.|arz)", RegexOptions.Compiled)
Further documentation available here: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions(v=VS.90).aspx
Upvotes: 3
Reputation: 8231
If you have a fixed set of keywords you can use (f)lex, re2c or ragel
Upvotes: 1
Reputation: 45101
Maybe it's a little overpowered but you should definitely take a look at ANTLR.
Upvotes: 0
Reputation: 81660
I suggest to approaches:
1) Tokenise using string.Split
and match against a Dictionary of keys you have
2) Implement tokeniser yourself a reader with ReadToken()
method which it adds the characters to a buffer until it finds (Split could be doing that) a split character and outputs that as token. Then you check against your dictionary.
Upvotes: 0
Reputation: 41769
This seems to fit "Algorithms using finite set of patterns"
The Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns "at once", so the complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).
The Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm). In contrast, the Aho–Corasick string matching algorithm has asymptotic worst-time complexity O(n+m) in space O(m).
Upvotes: 17