Abot web crawler store web pages or just images into folder

Question

I am using Google Abot web crawler and would like to know how to store separate web pages or just images into a folder. I checked the forum where it shows the following. But I cannot store into the same file multiple times, does that mean I have to create a different file name each time, or is there a simpler way of storing the web pages. Also, if I only want to store the images, what options should I use? I checked the other Abot stackoverflow posts and found the following crawledpage content as commented. How do I use them to store only images?

//crawledPage.RawContent   //raw html
//crawledPage.HtmlDocument //lazy loaded html agility pack object (HtmlAgilityPack.HtmlDocument)
//crawledPage.CSDocument   //lazy loaded cs query object (CsQuery.Cq)

void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    CrawledPage crawledPage = e.CrawledPage;

    if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)
            Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);
    else
            Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);

    if (!string.IsNullOrEmpty(crawledPage.Content.Text))
            File.WriteAllText(SOMEFILEPATH, crawledPage.Content.Text); //or crawledPage.Content.Bytes

}

P.S. I got it to store the web page using crawledPage.HtmlDocument.Save(@"C://TESTCRAWL/FILE"+rnd.Next(1, 100).ToString()+".html",System.Text.Encoding.UTF8); Is there a way to get just the image?

Vikash Rathee · Accepted Answer

ABot don't download the image automatically, it's built to crawl the web urls and you would need to write your code to extract the image urls and then loop through all urls

Step 1 : Extract the image SRC from webpage source using HtmlAgilityPack

 List imgScrs = new List();
 HtmlDocument doc = new HtmlDocument();
 doc.LoadHtml(crawledPage.Content.Text);
 var nodes = doc.DocumentNode.SelectNodes(@"//img[@src]"); s
 foreach (var img in nodes)
 {
    HtmlAttribute att = img["src"];
    imgScrs.Add(att.Value)
 }

Step 2 : Loop through each src in list and download image in c: drive

int i = 0;
foreach (string src in imgScrs)
{
  client.DownloadFile(new Uri(src), @"c:	emp\image_" + i +".jpg");
  i++;      
}

Note : : I'm using "i" variable to give a unique name to each image otherwise this will overwrite the same image each time

Abot web crawler store web pages or just images into folder

Answers (2)

Related Questions