Reputation: 13
I am working on a web crawler. I am using the Webbrowser control for this purpose. I have got the list of urls stored in database and I want to traverse all those URLs one by one and parse the HTML.
I used the following logic
foreach (string href in hrefs)
{
webBrowser1.Url = new Uri(href);
webBrowser1.Navigate(href);
}
I want to do some work in the "webBrowser1_DocumentCompleted" event once the page is loaded completely. But the "webBrowser1_DocumentCompleted" does not get the control as I am using the loop here. It only get the control when the last url in "hrefs" is navigated and the control exits the loop.
Whats the best way to handle such problem?
Upvotes: 1
Views: 6828
Reputation: 39916
First of all, you are setting new url to same web browser control, even before it has loaded anything, this way you will simply see the last url on your browser. Definately browser will certainly take some time to load url, so I guess navigation is cancelled well in advance before Document_Completed can be fired.
There is only one way to do this simultaneously,
You have to use a tab control, and open a new tab item for every url and each tab item will have its own web browser control and you can set its url.
foreach(string href in hrefs){
TabItem item = new TabItem();
WebBrowser wb = new WebBrowser();
wb.DocumentCompleted += wb_DocumentCompleted;
wb.Url = href;
item.Child = web;
tabControl1.Items.Add(item);
}
private void wb_DocumentCompleted(object sender, EventArgs e){
/// do your stuff...
}
In order to improve above method, you should see how can you create multiple tab items in different UI threads, its pretty log topic to discuss here, but it is still possible.
Another method is to do use a queue...
private static Queue<string> queue = new ...
foreach(string href in hrefs){
queue.Enqueue(href);
}
private void webBrowser1_DocumentCompleted(object sender, EventArgs e){
if(queue.Count>0){
webBrowser1.Url = queue.Dequeue();
}
}
Upvotes: 1
Reputation: 1499750
Store the list somewhere in your state, as well as the index of where you've got to. Then in the DocumentCompleted
event, parse the HTML and then navigate to the next page.
(Personally I wouldn't use the WebBrowser
control for web crawling... I know it means it'll handle the JavaScript for you, but it'll be a lot harder to parallelize nicely than using multiple WebRequest
or WebClient
objects.)
Upvotes: 4