Reputation: 841
I'm new to F# and functional and am working on some HTML parsing code. I want to remove from a HTML document elements that match some criteria. Here I have a sequence of objects (HtmlNodes) and want to remove them from the document.
Is this idiomatic way of using pattern matching? Also as HtmlNode.Remove() has a side-effect on the original HtmlDocument object, is there any particular way of structuring the code to make the side-effect obvious or how should this be handled. You can be as pedantic as you like with the code.
open HtmlAgilityPack
let removeNodes (node : HtmlNode) =
let (|HiddenNodeCount|) (n : HtmlNode) =
match n.SelectNodes("*[@style[contains(.,'visibility:hidden')]]") with
| null -> 0
| _ as x -> Seq.length x
match node with
| x when x.Name.ToLower() = "script" -> node.Remove()
| x when x.NodeType = HtmlNodeType.Comment -> node.Remove()
| HiddenNodeCount x when x > 0 -> node.Remove()
| _ -> ()
let html = "some long messy html code would be here"
let dom = new HtmlDocument(OptionAutoCloseOnEnd=true)
dom.LoadHtml(html)
let nodes = dom.DocumentNode.DescendantNodes()
nodes |> Seq.toArray |> Array.iter removeNodes
Upvotes: 1
Views: 344
Reputation: 22297
Personally, I prefer if elif else
over pattern matching when you don't have a data structure to decompose (it's just less typing, and may also serve to differentiate between when a structure is being decomposed versus simpler case testing).
There are some odd things in your code. The Active Pattern isn't very helpful here for two reasons: first, its scope is limited to removeNodes so it is only used once. I'll address the second issues later, but first I will show how I would write this by eliminating the Active Pattern and, for me at least, making the side-effects more obvious (by separating the code which tests whether a node should be removed from the code that does the removing):
let shouldRemoveNode (node : HtmlNode) =
if node.Name.ToLower() = "script" then true
elif node.NodeType = HtmlNodeType.Comment then true
else match node.SelectNodes("*[@style[contains(.,'visibility:hidden')]]") with
| null -> false
| x -> Seq.length x > 0
let removeNode (node: HtmlNode) =
if shouldRemoveNode(node) then node.Remove() else ()
Notice I do use a pattern match in the visibility hidden query since I do get to match against null and bind to x otherwise (rather than binding to x, and then testing x with if else
).
The second odd thing with your Active Pattern is that you are using it for converting a node to an int, but the length you obtain isn't immediately useful (you still need to perform a test against it). Whereas the more powerful use of an Active Pattern here would be to carve up nodes into different kinds (assuming this isn't ad-hoc, which was may first point). So you could have:
//expand to encompass several other kinds of nodes
let (|Script|Comment|Hidden|Other|) (node : HtmlNode) =
if node.Name.ToLower() = "script" then Script
elif node.NodeType = HtmlNodeType.Comment then Comment
else match node.SelectNodes("*[@style[contains(.,'visibility:hidden')]]") with
| null -> Other
| x -> if Seq.length x > 0 then Hidden
else Other
let removeNode (node: HtmlNode) =
match node with
| Script | Comment | Hidden -> node.Remove()
| Other -> ()
Edit:
@Pascal made the observation in the comments that shouldRemoveNode
can be further condensed into one big boolean expression:
let shouldRemoveNode (node : HtmlNode) =
node.Name.ToLower() = "script" ||
node.NodeType = HtmlNodeType.Comment ||
match node.SelectNodes("*[@style[contains(.,'visibility:hidden')]]") with
| null -> false
| x -> Seq.length x > 0
Upvotes: 1
Reputation: 118865
It isn't clear to me that this is any better than using functions and if-then-else, e.g.
let HiddenNodeCount (n : HtmlNode) =
match n.SelectNodes("*[@style[contains(.,'visibility:hidden')]]") with
| null -> 0
| x -> Seq.length x
if node.Name.ToLower() = "script" then
node.Remove()
elif node.NodeType = HtmlNodeType.Comment then
node.Remove()
elif HiddenNodeCount node > 0 then
node.Remove()
Upvotes: 0