Reputation: 81
I am using Google Abot web crawler and would like to know how to store separate web pages or just images into a folder. I checked the forum where it shows the following. But I cannot store into the same file multiple times, does that mean I have to create a different file name each time, or is there a simpler way of storing the web pages. Also, if I only want to store the images, what options should I use? I checked the other Abot stackoverflow posts and found the following crawledpage content as commented. How do I use them to store only images?
//crawledPage.RawContent //raw html
//crawledPage.HtmlDocument //lazy loaded html agility pack object (HtmlAgilityPack.HtmlDocument)
//crawledPage.CSDocument //lazy loaded cs query object (CsQuery.Cq)
void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
CrawledPage crawledPage = e.CrawledPage;
if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)
Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);
else
Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);
if (!string.IsNullOrEmpty(crawledPage.Content.Text))
File.WriteAllText(SOMEFILEPATH, crawledPage.Content.Text); //or crawledPage.Content.Bytes
}
P.S. I got it to store the web page using crawledPage.HtmlDocument.Save(@"C://TESTCRAWL/FILE"+rnd.Next(1, 100).ToString()+".html",System.Text.Encoding.UTF8); Is there a way to get just the image?
Upvotes: 1
Views: 2183
Reputation: 121
Right now you can make Abot (c#) to download images for you. There are at least 2 solutions for that.
Preparation
In each solution create and use your custom CrawlConfiguration
instance and pass it to SiteCrawler
constructor.
Include your image type MIMEs in your configuration object, e.g.
config.DownloadableContentTypes = "text/html,application/json,text/plain,image/jpeg,image/pjpeg,*/*"
Solution 1
LinkSelector
inheriting from HapHyperLinkParser
and pass it to SiteCrawler
contructor.LinkSelector
override GetHrefValues
. Extract images URLs from the downloaded page and include them in the returned list.crawler_ProcessPageCrawlCompleted
handler by referring crawledPage.Content.Bytes
.Solution 2
Extract image URLs in your crawler_ProcessPageCrawlCompleted
handler and add them to your crawler scheduler like this
e.CrawlContext.Scheduler.Add(new PageToCrawl(new Uri(pictureUrl)));
Your images will be downloaded the same way as they would be any other HTML page.
Save the images in your crawler_ProcessPageCrawlCompleted
handler by referring crawledPage.Content.Bytes
.
In either case you could distinguish if this is a page or image by e.g. page URL.
Benefits
There are significant benefits to use your crawler instead of separate downloader.
If the website requires login before you can download anything, you can establish the session for the crawler and not worry about opening another session. Some websites prevent multiple logins for the same user, too.
Also, you need to be careful with separate downloaders and make sure that they don't establish new connections for each image. I'd recommend to create connection pooler and reuse it. Otherwise you can bring the server down.
My preference is still to use just the crawler.
Upvotes: 3
Reputation: 2064
ABot don't download the image automatically, it's built to crawl the web urls and you would need to write your code to extract the image urls and then loop through all urls
Step 1 : Extract the image SRC from webpage source using HtmlAgilityPack
List<string> imgScrs = new List<string>();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(crawledPage.Content.Text);
var nodes = doc.DocumentNode.SelectNodes(@"//img[@src]"); s
foreach (var img in nodes)
{
HtmlAttribute att = img["src"];
imgScrs.Add(att.Value)
}
Step 2 : Loop through each src in list and download image in c: drive
int i = 0;
foreach (string src in imgScrs)
{
client.DownloadFile(new Uri(src), @"c:\temp\image_" + i +".jpg");
i++;
}
Note : : I'm using "i" variable to give a unique name to each image otherwise this will overwrite the same image each time
Upvotes: 3