Reputation: 704

Efficient method for checking substrings C#

I have a bunch of txt files that contains 300k lines. Each line has a URL. E.g. http://www.ieee.org/conferences_events/conferences/conferencedetails/index.html?Conf_ID=30718

In some string[] array I have a list of web-sites

amazon.com
google.com
ieee.org
...

I need to check whether that URL contains one of web-sites and update some counter that corresponds to certain web-site?

For now I'm using contains method, but it is very slow. There are ~900 records in array, so Worst case is 900*300K(for 1 file). I believe, that indexOf will be slow as well.

Can someone help me with faster approach? Thank you in advance

Upvotes: 4

Answers (6)

McAden

Reputation: 13972

You'd have to test the performance but you might try converting the urls to the actual System.Uri object.

Store the list of websites as a HashSet<string> - then use the HashSet to look up the Uri's Host:

IEnumerable<Uri> inputUrls = File.ReadAllLines(@"c:\myFile.txt").Select(e => new Uri(e));
string[] myUrls = new[] { "amazon.com", "google.com", "stackoverflow.com" };
HashSet<string> urls = new HashSet<string>(myUrls);
IEnumerable<Uri> matches = inputUrls.Where(e => urls.Contains(e.Host));

Upvotes: 0

stovroz

Reputation: 7065

Your problem as you describe it should not involve searching for substrings at all. Split your source file up into lines (or read it in line by line) which you already know will each contain a URL, and run it through some function to extract the domain name, then compare this with some fast access tally of your target domains such as a Dictionary<string, int>, incrementing as you go, e.g.:

var source = Enumerable.Range(0, 300000).Select(x => Guid.NewGuid().ToString()).Select(x => x.Substring(0, 4) + ".com/" + x.Substring(4, 10));
var targets = Enumerable.Range(0, 900).Select(x => Guid.NewGuid().ToString().Substring(0, 4) + ".com").Distinct();
var tally = targets.ToDictionary(x => x, x => 0);
Func<string, string> naiveDomainExtractor = x=> x.Split('/')[0];
foreach(var line in source)
{
    var domain = naiveDomainExtractor(line);
    if(tally.ContainsKey(domain)) tally[domain]++;
}

...which takes a third of a second on my not particularly speedy machine, including generation of test data.

Admittedly your domain extractor maybe a bit more sophisticated but it will probably not be very processor intensive, and if you've got multiple cores at your disposal you can speed things up further by using a ConcurrentDictionary<string, int> and Parallel.ForEach.

Upvotes: 0

Tony Hopkinson

Reputation: 20320

Well in a sort of similar need, though with indexof, I achieved a huge performance improvement with a simple loop

as in something like

int l = url.length;
int position = 0;
while (position < l)
{
   if (url[i] == website[0])
   {
      //test rest of web site from position in an other loop
      if (exactMatch(url,position, website))
   }
}

Seems a bit wrong but in extreme cases searching for a set of strings (about 10) in a large structured (1.2Mb) file (so regex was out), I went from 3 minutes, to < 1 second.

Upvotes: 0

Nikola Radosavljević

Reputation: 6911

Good solution would leverage hashing. My approach would be following

Hash all your known hosts (the string[] collection that you mention)
Store the hash in a List<int> (hashes.Add("www.ieee.com".GetHashCode())
Sort the list (hashes.Sort())
When looking up a url:
1. Parse out host name from the url (get ieee.com from http://www.ieee.com/...). You can use new Uri("http://www.ieee.com/...").Host to get www.ieee.com.
2. Preprocess it to always expect same case. Use lower case (if you have http://www.IEee.COM/ take www.ieee.com)
3. Hash parsed host name, and look for it in the hashes list. Use BinarySearch method to find the hash.
4. If the hash exists, then you have this host in your list

Even faster, and memory efficient way is to use Bloom filters. I suggest you read about them on wikipedia, and there's even a C# implementation of bloom filter on CodePlex. Of course, you need to take into account that bloom filter allows false positive results (it can tell you that a value is in a collection even though it's not), so it's used for optimization only. It does not tell you that something is not in a collection if it is really not.

Using a Dictionary<TKey, TValue> is also an option, but if you only need to count number of occurrences, it's more efficient to maintain collection of hashes yourself.

Upvotes: 3

Bernhard Barker

Reputation: 55609

Create a Dictionary of domain to counter.

For each URL, extract the domain (I'll leave that part to you to figure out), then look up the domain in the Dictionary and increment the counter.

I assume we're talking about domains since this is what you showed in your array as examples. If this can be any part of the URL instead, storing all your strings in a trie-like structure could work.

Upvotes: 1

Only a Curious Mind

Reputation: 2857

You can read this question, the answers will be help you:

High performance "contains" search in list of strings in C#

Upvotes: 0

Efficient method for checking substrings C#

Answers (6)

Related Questions