Cogito Ergo Sum
Cogito Ergo Sum

Reputation: 810

F# Data: How to get all navigation links from a website?

I found this question and answer and tried to implement it in my code but it doesn't seem to work.

F#.Data HTML Parser Extracting Strings From Nodes

I tried both what the asker tried in his question and I do not see any output when I print the results and I also tried one of the recommended implementations and it also prints nothing:

let links = 
    results.Descendants("td")
    |> Seq.filter (fun x -> x.HasClass("pagenav"))
    |> Seq.collect (fun x -> x.Elements("a"))
    |> Seq.map (fun y -> y.AttributeValue("href"))
    |> Seq.toList

I am successfully retrieving the web page and I can even print the HTML so I know that part is working. My code is as follows:

open System.IO
open FSharp.Data
open FSharp.Data

[<EntryPoint>]
let main (args: string[]) =      
    let htmlPage = HtmlDocument.Load("https://scrapethissite.com/")
    printfn "%s" (string htmlPage) // I know it is getting the html

    // The asker of the origional question stated this printed out the links but just prints <null> 
    // for me
    let links1 = 
        htmlPage.Descendants("td")
        |> Seq.filter (fun x -> x.HasClass("pagenav"))
        |> Seq.map (fun x -> x.Elements("a"))
        |> Seq.iter (fun x -> x |> Seq.iter (fun y -> y.AttributeValue("href") |> printf "%A"))
    
    
    printfn "Links1 : %A" links1

// A combination of attempts to get it to print just something from the html and no luck, just empty.
let links = 
    HtmlDocument.elementsNamed ["a"] htmlPage
    //htmlPage.Elements("a")
    //htmlPage.Descendants("td")
    //|> Seq.filter (fun x -> x.HasClass("pagenav"))
    //|> Seq.collect (fun x -> x.Elements("a"))
    //|> Seq.map (fun y -> y.AttributeValue("href"))
    |> Seq.toList

printfn "Links: %A" links

Console.ReadKey() |> ignore
0 // return an integer exit code

Any help would be appreciated. Thanks.

Upvotes: 0

Views: 306

Answers (1)

DaveShaw
DaveShaw

Reputation: 52798

Assuming you want the links from https://scrapethissite.com/ then you would need to look at the HTML of those navigation links and find a pattern that would return them.

Looking at the source of the page shows:

<li id="nav-homepage" class="active">
  <a href="/" class="nav-link hidden-sm hidden-xs">
    <img src="/static/images/scraper-icon.png" id="nav-logo">
    Scrape This Site
  </a>
</li>

For the first navigation link across the top.

Looking at the other buttons I see a similar pattern of:

<a href="/pages/" class="nav-link">
  <i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
  Sandbox
</a>

Each of the navigation links has a class nav-link that you could search for.

So taking the original suggestion you are working from and modifying it like so should work:

htmlPage.Descendants("a")
|> Seq.filter (fun x -> x.HasClass("nav-link"))
|> Seq.iter (fun x -> x |> Seq.iter (fun y -> y.AttributeValue("href") |> printf "%A"))

This looks for all the <a> elements on the entire page, filters it to ones with the class nav-link and then prints the ones with a href value.

Different sites will have different HTML, and depending on how strict you want to be you might try different approaches. I could have looked for a ul with a class of nav, for example, then just pulled the links from there.

Usually it is just a case of looking at the source and discerning a pattern for how to find the things you are looking for.

Upvotes: 2

Related Questions