Reputation: 717
I have a stream of sentences (tweets) and over 10 million names. I want to determine if a single sentence (tweet) contains mention of one of the 10 million names. I could compile regex for all the possible patterns but I would really like to know if there is an efficient algorithm to do that.
Thanks,
Upvotes: 3
Views: 992
Reputation: 31
You could go about it from the other way round. As the sentence comes in, split it into tokens and build a RegEx Pattern for each token, something like ^token\s*. Compare each of those against your 10 million names assuming each on line.
Upvotes: 0
Reputation: 23665
I don't think you need pattern matching at all, if you only seek for the occurence of a simple string (name). If you are actually aiming at twitter names -- are they not prefixed with an @ sign when mentioned in tweets? If so, at first just seek for the @ sign and proceed from there.
To check if the string after the @ is one of your 10 million strings, a prefix tree as proposed by ruakh is definitely a good idea .
Upvotes: 0