Keith
Keith

Reputation: 139

Image scraper with C#

I'm trying to go through a web pages source code, add the <img src="http://www.dot.com/image.jpg" to an HtmlElementCollection. Then I'm attempting to cycle through each element in the element collection with a foreach loop and download the images through the url.

Here's what I have so far. My problem right now is nothing is downloading, and I don't think my elements are being added properly by tag name. If they are I can't seem to reference them for the download.

public partial class Form1 : Form
{
    public Form1()
    {
        InitializeComponent();
    }

    public void button1_Click(object sender, EventArgs e)
    {
        string url = urlTextBox.Text;
        string sourceCode = WorkerClass.ScreenScrape(url);
        StreamWriter sw = new StreamWriter("sourceScraped.html");
        sw.Write(sourceCode);
    }

    private void button2_Click(object sender, EventArgs e)
    {
        string url = urlTextBox.Text;
        WebBrowser browser = new WebBrowser();
        browser.Navigate(url);
        HtmlElementCollection collection;
        List<HtmlElement> imgListString = new List<HtmlElement>();
        if (browser != null)
        {
            if (browser.Document != null)
            {
                collection = browser.Document.GetElementsByTagName("img");
                if (collection != null)
                {
                    foreach (HtmlElement element in collection)
                    {
                        WebClient wClient = new WebClient();
                        string urlDownload = element.FirstChild.GetAttribute("src");
                        wClient.DownloadFile(urlDownload, urlDownload.Substring(urlDownload.LastIndexOf('/')));
                    }
                }
            }
        }
    }
}

}

Upvotes: 1

Views: 6738

Answers (3)

Keith
Keith

Reputation: 139

To anyone interested, here was the solution. It's exactly what Damith said. I found Html Agility Pack to be rather broken. That was the first thing I tried using. This ended up being a more viable solution for me and this is my final code.

private void button2_Click(object sender, EventArgs e)
    {
        string url = urlTextBox.Text;
        WebBrowser browser = new WebBrowser();
        browser.Navigate(url);
        browser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(DownloadFiles);
    }

    private void DownloadFiles(object sender, WebBrowserDocumentCompletedEventArgs e)
    {

        HtmlElementCollection collection;
        List<HtmlElement> imgListString = new List<HtmlElement>();

        if (browser != null)
        {
            if (browser.Document != null)
            {
                collection = browser.Document.GetElementsByTagName("img");
                if (collection != null)
                {
                    foreach (HtmlElement element in collection)
                    {
                        string urlDownload = element.GetAttribute("src");
                        if (urlDownload != null && urlDownload.Length != 0)
                        {
                            WebClient wClient = new WebClient();
                            wClient.DownloadFile(urlDownload, "C:\\users\\folder\\location\\" + urlDownload.Substring(urlDownload.LastIndexOf('/')));
                        }
                    }
                }
            }
        }
    }
}

}

Upvotes: 0

Damith
Damith

Reputation: 63065

Ones you call navigate, you assume document is ready to traverse and check for images. but practically it take some time to load. You need to wait until Document loading Completed.

Add event DocumentCompleted to your browser object

 browser.DocumentCompleted += browser_DocumentCompleted;

implement it as

static void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    WebBrowser browser = (WebBrowser)sender;
    HtmlElementCollection collection;
    List<HtmlElement> imgListString = new List<HtmlElement>();
    if (browser != null)
    {
        if (browser.Document != null)
        {
            collection = browser.Document.GetElementsByTagName("img");
            if (collection != null)
            {
                foreach (HtmlElement element in collection)
                {
                    WebClient wClient = new WebClient();
                    string urlDownload = element.GetAttribute("src");
                    wClient.DownloadFile(urlDownload, urlDownload.Substring(urlDownload.LastIndexOf('/')));
                }
            }
        }
    }
}

Upvotes: 3

nunespascal
nunespascal

Reputation: 17724

Take a look at Html Agility Pack.

What you need to do is download and parse the HTML, and then process the elements you are interested in. It is a good tool for such tasks.

Upvotes: 0

Related Questions