user1352777
user1352777

Reputation: 81

Abot web crawler store web pages or just images into folder

I am using Google Abot web crawler and would like to know how to store separate web pages or just images into a folder. I checked the forum where it shows the following. But I cannot store into the same file multiple times, does that mean I have to create a different file name each time, or is there a simpler way of storing the web pages. Also, if I only want to store the images, what options should I use? I checked the other Abot stackoverflow posts and found the following crawledpage content as commented. How do I use them to store only images?

//crawledPage.RawContent   //raw html
//crawledPage.HtmlDocument //lazy loaded html agility pack object (HtmlAgilityPack.HtmlDocument)
//crawledPage.CSDocument   //lazy loaded cs query object (CsQuery.Cq)

void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    CrawledPage crawledPage = e.CrawledPage;

    if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)
            Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);
    else
            Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);

    if (!string.IsNullOrEmpty(crawledPage.Content.Text))
            File.WriteAllText(SOMEFILEPATH, crawledPage.Content.Text); //or crawledPage.Content.Bytes

}

P.S. I got it to store the web page using crawledPage.HtmlDocument.Save(@"C://TESTCRAWL/FILE"+rnd.Next(1, 100).ToString()+".html",System.Text.Encoding.UTF8); Is there a way to get just the image?

Upvotes: 1

Views: 2183

Answers (2)

mlemanczyk
mlemanczyk

Reputation: 121

Right now you can make Abot (c#) to download images for you. There are at least 2 solutions for that.

Preparation

In each solution create and use your custom CrawlConfiguration instance and pass it to SiteCrawler constructor.

Include your image type MIMEs in your configuration object, e.g.

config.DownloadableContentTypes = "text/html,application/json,text/plain,image/jpeg,image/pjpeg,*/*"

Solution 1

  1. Create your own LinkSelector inheriting from HapHyperLinkParser and pass it to SiteCrawler contructor.
  2. In the LinkSelector override GetHrefValues. Extract images URLs from the downloaded page and include them in the returned list.
  3. Save the images in your crawler_ProcessPageCrawlCompleted handler by referring crawledPage.Content.Bytes.

Solution 2

  1. Extract image URLs in your crawler_ProcessPageCrawlCompleted handler and add them to your crawler scheduler like this

    e.CrawlContext.Scheduler.Add(new PageToCrawl(new Uri(pictureUrl)));

    Your images will be downloaded the same way as they would be any other HTML page.

  2. Save the images in your crawler_ProcessPageCrawlCompleted handler by referring crawledPage.Content.Bytes.

In either case you could distinguish if this is a page or image by e.g. page URL.

Benefits

There are significant benefits to use your crawler instead of separate downloader.

If the website requires login before you can download anything, you can establish the session for the crawler and not worry about opening another session. Some websites prevent multiple logins for the same user, too.

Also, you need to be careful with separate downloaders and make sure that they don't establish new connections for each image. I'd recommend to create connection pooler and reuse it. Otherwise you can bring the server down.

My preference is still to use just the crawler.

Upvotes: 3

Vikash Rathee
Vikash Rathee

Reputation: 2064

ABot don't download the image automatically, it's built to crawl the web urls and you would need to write your code to extract the image urls and then loop through all urls

Step 1 : Extract the image SRC from webpage source using HtmlAgilityPack

 List<string> imgScrs = new List<string>();
 HtmlDocument doc = new HtmlDocument();
 doc.LoadHtml(crawledPage.Content.Text);
 var nodes = doc.DocumentNode.SelectNodes(@"//img[@src]"); s
 foreach (var img in nodes)
 {
    HtmlAttribute att = img["src"];
    imgScrs.Add(att.Value)
 }

Step 2 : Loop through each src in list and download image in c: drive

int i = 0;
foreach (string src in imgScrs)
{
  client.DownloadFile(new Uri(src), @"c:\temp\image_" + i +".jpg");
  i++;      
}

Note : : I'm using "i" variable to give a unique name to each image otherwise this will overwrite the same image each time

Upvotes: 3

Related Questions