Reputation: 1487
I'm currently making a Web Crawler in C#, and I have a method that receives HTML strings, extracts links from them and inserts the links into the list of all the links captured.
Since it is multi-threaded, I have used locks to prevent the list of all the strings to be accessed from a few different threads at the same time.
Which is better to do with locks?
This:
void ProcessHTML(string HTML)
{
List<string> Links = GetLinks(HTML);
for (int i = 0; i < Links.Count; i++)
{
lock (WebsitesHash)
{
lock (AllLinks)
{
if (!WebsitesHash.ContainsKey(Links[i]))
{
WebsitesHash[Links[i]] = true;
AllLinks.Add(Links[i]);
}
}
}
}
}
Or this:
void ProcessHTML(string HTML)
{
List<string> Links = GetLinks(HTML);
lock (WebsitesHash)
{
lock (AllLinks)
{
for (int i = 0; i < Links.Count; i++)
{
if (!WebsitesHash.ContainsKey(Links[i]))
{
WebsitesHash[Links[i]] = true;
AllLinks.Add(Links[i]);
}
}
}
}
}
Which is generally considered better to do - lock in each iteration, or lock all the iterations?
Other code that might be relevant:
void StartCrawl(string Seed)
{
AllLinks.Capacity = 1000 * 1000 * 10;
StreamWriter Log = new StreamWriter(File.Open("Websites.txt", FileMode.Append));
string HTML = GetHTML(Seed);
ProcessHTML(HTML);
for (int i = 0; i < AllLinks.Count; i++)
{
if (!Work)
{
Log.Close();
WebsitesHash = new Dictionary<string, bool>();
break;
}
Log.WriteLine(AllLinks[i]);
websText.Text = AllLinks.Count + "";
try { HTML = GetHTML(AllLinks[i]); }
catch { continue; }
Thread Parser = new Thread(() => ProcessHTML(HTML));
Parser.Start();
}
}
Upvotes: 4
Views: 2704
Reputation: 186
Let AllLinks is global storage of links:
public List<string> AllLinks = new List<string>();
Use List.BinarySearch method somewhere in code for adding new link:
// "link" contain string of html link
lock(AllLinks)
{
int index = AllLinks.BinarySearch(link);
if( index < 0 )
{ // link is not in AllLinks
AllLinks.Add(~index, link);
}
else
{ // link exist, "index" contain its position in list
// ...
}
}
I think that WebsitesHash object is not necessary.
UPD Additional advantage of using BinarySearch is the sorted state of AllLinks.
Upvotes: 1
Reputation: 273524
In this case, it won't matter a whole lot.
The links are retreived outside the lock so the only action is adding a handful strings to a list. That is very small so the question is moot.
If the amount of work was larger, locking inside the loop would have been preferable.
And although locks are cheap, you can optimize a little by locking just once. You can use a private object lockObject = new object();
to be clearer about the protocol.
Upvotes: 2