Reputation: 2402
I am trying to download files from website. My current solution seems to work but there are some things I don't understand.
First issue comes while:
//div[@class='large-4 medium-4 columns']//a
There are other divs with class large-4 medium-4 columns
. So I am getting couple of unnecessary links. How to get rid of them? I need only pages that contain /products/
Second issue is that nothing gets downloaded to C:\temp\
and I guess there is something with:
//div[@class='large-6 medium-8 columns large-centered']/a[string-length(@href)>0]
but what is wrong?
"xxx" is the link in my code and it should be
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim mainUrl As String = "xxx"
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load(mainUrl) '< - - - Load the webage into htmldocument
Dim listLinks As New List(Of String)
Dim srcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[@class='large-4 medium-4 columns']//a") '< - - - select nodes with links
For Each src As HtmlNode In srcs
' Store links in array
listLinks.Add(src.Attributes("href").Value)
Console.WriteLine(src.Attributes("href").Value)
Next
Console.Read()
For Each productLink As String In listLinks
Dim prodDoc As HtmlDocument = New HtmlWeb().Load(productLink)
Dim scrapedsrcs As HtmlNodeCollection = prodDoc.DocumentNode.SelectNodes("//div[@class='large-6 medium-8 columns large-centered']/a[string-length(@href)>0]") '< - - - select nodes with links
If scrapedsrcs IsNot Nothing Then
For Each scrapedlink As HtmlNode In scrapedsrcs
' Show links in console
'Console.WriteLine($"-- {scrapedlink.Attributes("href").Value}") '< - - - Print urls
My.Computer.Network.DownloadFile(scrapedlink.Attributes("href").Value, "C:\temp\" & System.IO.Path.GetFileName(scrapedlink.Attributes("href").Value) & ".pdf")
Next
End If
Next
Console.Read()
' End of scraping
End Sub
End Module
EDIT:
Ok, first one should be
//div[@class='row inset1 productItem padb1 padt1']/div[@class='large-4 medium-4 columns']//a
Upvotes: 0
Views: 448
Reputation: 1173
This will download brochures to folder where app is run:
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load("https://webpage.com")
Dim ProductListPage As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[@class='productContain padb6']//div[@class='large-4 medium-4 columns']/a")
For Each src As HtmlNode In ProductListPage
htmlDoc = New HtmlWeb().Load(src.Attributes("href").Value)
Dim LinkTester As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[@class='row padt6 padb4']//a")
If LinkTester IsNot Nothing Then
For Each dllink In LinkTester
Dim LinkURL As String = dllink.Attributes("href").Value
Console.WriteLine(LinkURL)
Dim ExtractFilename As String = LinkURL.Substring(LinkURL.LastIndexOf("/"))
Dim DLClient As New WebClient
DLClient.DownloadFileAsync(New Uri(LinkURL), ".\" & ExtractFilename)
Next
End If
Next
Upvotes: 1