Reputation: 704
I have a bunch of txt
files that contains 300k lines. Each line has a URL
. E.g. http://www.ieee.org/conferences_events/conferences/conferencedetails/index.html?Conf_ID=30718
In some string[]
array I have a list of web-sites
amazon.com
google.com
ieee.org
...
I need to check whether that URL
contains one of web-sites and update some counter that corresponds to certain web-site?
For now I'm using contains
method, but it is very slow. There are ~900 records in array, so Worst case is 900*300K(for 1 file). I believe, that indexOf
will be slow as well.
Can someone help me with faster approach? Thank you in advance
Upvotes: 4
Views: 1133
Reputation: 13972
You'd have to test the performance but you might try converting the urls to the actual System.Uri
object.
Store the list of websites as a HashSet<string>
- then use the HashSet to look up the Uri's Host
:
IEnumerable<Uri> inputUrls = File.ReadAllLines(@"c:\myFile.txt").Select(e => new Uri(e));
string[] myUrls = new[] { "amazon.com", "google.com", "stackoverflow.com" };
HashSet<string> urls = new HashSet<string>(myUrls);
IEnumerable<Uri> matches = inputUrls.Where(e => urls.Contains(e.Host));
Upvotes: 0
Reputation: 7065
Your problem as you describe it should not involve searching for substrings at all. Split your source file up into lines (or read it in line by line) which you already know will each contain a URL, and run it through some function to extract the domain name, then compare this with some fast access tally of your target domains such as a Dictionary<string, int>
, incrementing as you go, e.g.:
var source = Enumerable.Range(0, 300000).Select(x => Guid.NewGuid().ToString()).Select(x => x.Substring(0, 4) + ".com/" + x.Substring(4, 10));
var targets = Enumerable.Range(0, 900).Select(x => Guid.NewGuid().ToString().Substring(0, 4) + ".com").Distinct();
var tally = targets.ToDictionary(x => x, x => 0);
Func<string, string> naiveDomainExtractor = x=> x.Split('/')[0];
foreach(var line in source)
{
var domain = naiveDomainExtractor(line);
if(tally.ContainsKey(domain)) tally[domain]++;
}
...which takes a third of a second on my not particularly speedy machine, including generation of test data.
Admittedly your domain extractor maybe a bit more sophisticated but it will probably not be very processor intensive, and if you've got multiple cores at your disposal you can speed things up further by using a ConcurrentDictionary<string, int>
and Parallel.ForEach
.
Upvotes: 0
Reputation: 20320
Well in a sort of similar need, though with indexof, I achieved a huge performance improvement with a simple loop
as in something like
int l = url.length;
int position = 0;
while (position < l)
{
if (url[i] == website[0])
{
//test rest of web site from position in an other loop
if (exactMatch(url,position, website))
}
}
Seems a bit wrong but in extreme cases searching for a set of strings (about 10) in a large structured (1.2Mb) file (so regex was out), I went from 3 minutes, to < 1 second.
Upvotes: 0
Reputation: 6911
Good solution would leverage hashing. My approach would be following
string[]
collection that you mention)List<int>
(hashes.Add("www.ieee.com".GetHashCode()
)hashes.Sort()
)ieee.com
from http://www.ieee.com/...
). You can use new Uri("http://www.ieee.com/...").Host
to get www.ieee.com
.http://www.IEee.COM/
take www.ieee.com
)hashes
list. Use BinarySearch
method to find the hash.Even faster, and memory efficient way is to use Bloom filters. I suggest you read about them on wikipedia, and there's even a C# implementation of bloom filter on CodePlex. Of course, you need to take into account that bloom filter allows false positive results (it can tell you that a value is in a collection even though it's not), so it's used for optimization only. It does not tell you that something is not in a collection if it is really not.
Using a Dictionary<TKey, TValue>
is also an option, but if you only need to count number of occurrences, it's more efficient to maintain collection of hashes yourself.
Upvotes: 3
Reputation: 55609
Create a Dictionary
of domain to counter.
For each URL, extract the domain (I'll leave that part to you to figure out), then look up the domain in the Dictionary
and increment the counter.
I assume we're talking about domains since this is what you showed in your array as examples. If this can be any part of the URL instead, storing all your strings in a trie-like structure could work.
Upvotes: 1
Reputation: 2857
You can read this question, the answers will be help you:
High performance "contains" search in list of strings in C#
Upvotes: 0