yanta
yanta

Reputation: 841

Is this correct use of pattern matching and active patterns?

I'm new to F# and functional and am working on some HTML parsing code. I want to remove from a HTML document elements that match some criteria. Here I have a sequence of objects (HtmlNodes) and want to remove them from the document.

Is this idiomatic way of using pattern matching? Also as HtmlNode.Remove() has a side-effect on the original HtmlDocument object, is there any particular way of structuring the code to make the side-effect obvious or how should this be handled. You can be as pedantic as you like with the code.

open HtmlAgilityPack

let removeNodes (node : HtmlNode) = 

    let (|HiddenNodeCount|) (n : HtmlNode) =
        match n.SelectNodes("*[@style[contains(.,'visibility:hidden')]]") with
        | null -> 0
        | _ as x -> Seq.length x

    match node with
        | x when x.Name.ToLower() = "script" -> node.Remove() 
        | x when x.NodeType = HtmlNodeType.Comment -> node.Remove()
        | HiddenNodeCount x when x > 0 -> node.Remove()
        | _ -> ()

let html = "some long messy html code would be here"
let dom = new HtmlDocument(OptionAutoCloseOnEnd=true)
dom.LoadHtml(html)

let nodes = dom.DocumentNode.DescendantNodes()
nodes |> Seq.toArray |> Array.iter removeNodes

Upvotes: 1

Views: 344

Answers (2)

Stephen Swensen
Stephen Swensen

Reputation: 22297

Personally, I prefer if elif else over pattern matching when you don't have a data structure to decompose (it's just less typing, and may also serve to differentiate between when a structure is being decomposed versus simpler case testing).

There are some odd things in your code. The Active Pattern isn't very helpful here for two reasons: first, its scope is limited to removeNodes so it is only used once. I'll address the second issues later, but first I will show how I would write this by eliminating the Active Pattern and, for me at least, making the side-effects more obvious (by separating the code which tests whether a node should be removed from the code that does the removing):

let shouldRemoveNode (node : HtmlNode) = 
    if node.Name.ToLower() = "script" then true
    elif node.NodeType = HtmlNodeType.Comment then true
    else match node.SelectNodes("*[@style[contains(.,'visibility:hidden')]]") with
         | null -> false
         | x -> Seq.length x > 0

let removeNode (node: HtmlNode) = 
    if shouldRemoveNode(node) then node.Remove() else ()

Notice I do use a pattern match in the visibility hidden query since I do get to match against null and bind to x otherwise (rather than binding to x, and then testing x with if else).

The second odd thing with your Active Pattern is that you are using it for converting a node to an int, but the length you obtain isn't immediately useful (you still need to perform a test against it). Whereas the more powerful use of an Active Pattern here would be to carve up nodes into different kinds (assuming this isn't ad-hoc, which was may first point). So you could have:

//expand to encompass several other kinds of nodes
let (|Script|Comment|Hidden|Other|) (node : HtmlNode) = 
    if node.Name.ToLower() = "script" then Script
    elif node.NodeType = HtmlNodeType.Comment then Comment
    else match node.SelectNodes("*[@style[contains(.,'visibility:hidden')]]") with
         | null -> Other
         | x -> if Seq.length x > 0 then Hidden
                else Other

let removeNode (node: HtmlNode) = 
    match node with
    | Script | Comment | Hidden -> node.Remove()
    | Other -> ()

Edit:

@Pascal made the observation in the comments that shouldRemoveNode can be further condensed into one big boolean expression:

let shouldRemoveNode (node : HtmlNode) = 
    node.Name.ToLower() = "script" || 
    node.NodeType = HtmlNodeType.Comment ||
    match node.SelectNodes("*[@style[contains(.,'visibility:hidden')]]") with
    | null -> false
    | x -> Seq.length x > 0

Upvotes: 1

Brian
Brian

Reputation: 118865

It isn't clear to me that this is any better than using functions and if-then-else, e.g.

let HiddenNodeCount (n : HtmlNode) = 
    match n.SelectNodes("*[@style[contains(.,'visibility:hidden')]]") with 
    | null -> 0 
    | x -> Seq.length x 

if node.Name.ToLower() = "script" then 
    node.Remove()  
elif node.NodeType = HtmlNodeType.Comment then
    node.Remove()  
elif HiddenNodeCount node > 0 then 
    node.Remove()  

Upvotes: 0

Related Questions