Reputation: 530
I'm developing a crawler and I need to save some evidences that the cralwer made his job.
I'm looking for a way to download all the HTML, CSS and JS of a sended URL and create the same folder struture of the target site.
I'll have to use Azure Functions to do the crawler.
The Idea is to scrappy a site, download the content and save in an Azure Blob.
I found this article about it, but it only show how to download the HTML, and I need to create the exacly same thing that the crawler saw (with images, CSS and processed JS).
I belive that all the absolute paths will work, the realy problem is the relative paths that I will create folders to save the files.
can someone help me?
Upvotes: 1
Views: 557
Reputation: 530
Well, I believe this answer can be helpful to those who have gone through the same thing as me.
My solution was to download the HTML (using HttpWebRequest) and write to file (stored in Azure Blobs).
In my case, I made a function to correct all the relatives paths in the HTML file, as below:
private static HtmlDocument CorrectHTMLReferencies(string urlRoot, string htmlContent)
{
HtmlDocument document = new HtmlDocument();
document.LoadHtml(htmlContent);
Regex rx = new Regex(@"([\w-]+\.)+[\w-]+(\/[\w- .\/?%&=]*)?");
var nodesIMG = document.DocumentNode.SelectNodes("//img");
var nodesCSS = document.DocumentNode.SelectNodes("//link");
var nodesJS = document.DocumentNode.SelectNodes("//script");
string protocol = "http:";
if (urlRoot.Contains(":"))
protocol = urlRoot.Split(':')[0] + ":";
void WatchURl(HtmlNodeCollection colNodes, string attr)
{
foreach (HtmlNode node in colNodes)
{
if (node.Attributes.Any(a => a.Name?.ToLower() == attr.ToLower()))
{
string link = node.Attributes[attr].Value;
if (rx.IsMatch(link))
{
if (link.Substring(0, 2) == "//")
{
string novaUrl = protocol + link;
node.SetAttributeValue(attr, novaUrl);
}
}
else
{
node.SetAttributeValue(attr, urlRoot + link);
}
}
}
}
WatchURl(nodesIMG, "src");
WatchURl(nodesCSS, "href");
WatchURl(nodesJS, "src");
return document;
}
instead of download all the website, I download only one file. it works (for me) ;)
Upvotes: 2