Reputation: 810
I found this question and answer and tried to implement it in my code but it doesn't seem to work.
F#.Data HTML Parser Extracting Strings From Nodes
I tried both what the asker tried in his question and I do not see any output when I print the results and I also tried one of the recommended implementations and it also prints nothing:
let links =
results.Descendants("td")
|> Seq.filter (fun x -> x.HasClass("pagenav"))
|> Seq.collect (fun x -> x.Elements("a"))
|> Seq.map (fun y -> y.AttributeValue("href"))
|> Seq.toList
I am successfully retrieving the web page and I can even print the HTML so I know that part is working. My code is as follows:
open System.IO
open FSharp.Data
open FSharp.Data
[<EntryPoint>]
let main (args: string[]) =
let htmlPage = HtmlDocument.Load("https://scrapethissite.com/")
printfn "%s" (string htmlPage) // I know it is getting the html
// The asker of the origional question stated this printed out the links but just prints <null>
// for me
let links1 =
htmlPage.Descendants("td")
|> Seq.filter (fun x -> x.HasClass("pagenav"))
|> Seq.map (fun x -> x.Elements("a"))
|> Seq.iter (fun x -> x |> Seq.iter (fun y -> y.AttributeValue("href") |> printf "%A"))
printfn "Links1 : %A" links1
// A combination of attempts to get it to print just something from the html and no luck, just empty.
let links =
HtmlDocument.elementsNamed ["a"] htmlPage
//htmlPage.Elements("a")
//htmlPage.Descendants("td")
//|> Seq.filter (fun x -> x.HasClass("pagenav"))
//|> Seq.collect (fun x -> x.Elements("a"))
//|> Seq.map (fun y -> y.AttributeValue("href"))
|> Seq.toList
printfn "Links: %A" links
Console.ReadKey() |> ignore
0 // return an integer exit code
Any help would be appreciated. Thanks.
Upvotes: 0
Views: 306
Reputation: 52798
Assuming you want the links from https://scrapethissite.com/
then you would need to look at the HTML of those navigation links and find a pattern that would return them.
Looking at the source of the page shows:
<li id="nav-homepage" class="active">
<a href="/" class="nav-link hidden-sm hidden-xs">
<img src="/static/images/scraper-icon.png" id="nav-logo">
Scrape This Site
</a>
</li>
For the first navigation link across the top.
Looking at the other buttons I see a similar pattern of:
<a href="/pages/" class="nav-link">
<i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
Sandbox
</a>
Each of the navigation links has a class nav-link
that you could search for.
So taking the original suggestion you are working from and modifying it like so should work:
htmlPage.Descendants("a")
|> Seq.filter (fun x -> x.HasClass("nav-link"))
|> Seq.iter (fun x -> x |> Seq.iter (fun y -> y.AttributeValue("href") |> printf "%A"))
This looks for all the <a>
elements on the entire page, filters it to ones with the class nav-link
and then prints the ones with a href
value.
Different sites will have different HTML, and depending on how strict you want to be you might try different approaches. I could have looked for a ul
with a class of nav
, for example, then just pulled the links from there.
Usually it is just a case of looking at the source and discerning a pattern for how to find the things you are looking for.
Upvotes: 2