DotNet
DotNet

Reputation: 717

Efficient string matching in java

I have a stream of sentences (tweets) and over 10 million names. I want to determine if a single sentence (tweet) contains mention of one of the 10 million names. I could compile regex for all the possible patterns but I would really like to know if there is an efficient algorithm to do that.

Thanks,

Upvotes: 3

Views: 992

Answers (4)

milam
milam

Reputation: 31

You could go about it from the other way round. As the sentence comes in, split it into tokens and build a RegEx Pattern for each token, something like ^token\s*. Compare each of those against your 10 million names assuming each on line.

Upvotes: 0

Ridcully
Ridcully

Reputation: 23665

I don't think you need pattern matching at all, if you only seek for the occurence of a simple string (name). If you are actually aiming at twitter names -- are they not prefixed with an @ sign when mentioned in tweets? If so, at first just seek for the @ sign and proceed from there.

To check if the string after the @ is one of your 10 million strings, a prefix tree as proposed by ruakh is definitely a good idea .

Upvotes: 0

mindas
mindas

Reputation: 26733

You might try using Bloom filter. Demo here.

Upvotes: 3

ruakh
ruakh

Reputation: 183612

You could build a trie (a prefix tree).

Upvotes: 3

Related Questions