Sean Clark Hess
Sean Clark Hess

Reputation: 16059

How would you express this in Haskell?

Would you use if/else to write this algorithm in Haskell? Is there a way to express it without them? It's hard to extract functions out of the middle that have meaning. This is just the output of a machine learning system.

I'm implementing the algorithm for classifying segments of html content as Content or Boilerplate described here. This has the weights already hard coded.

curr_linkDensity <= 0.333333
| prev_linkDensity <= 0.555556
| | curr_numWords <= 16
| | | next_numWords <= 15
| | | | prev_numWords <= 4: BOILERPLATE
| | | | prev_numWords > 4: CONTENT
| | | next_numWords > 15: CONTENT
| | curr_numWords > 16: CONTENT
| prev_linkDensity > 0.555556
| | curr_numWords <= 40
| | | next_numWords <= 17: BOILERPLATE
| | | next_numWords > 17: CONTENT
| | curr_numWords > 40: CONTENT
curr_linkDensity > 0.333333: BOILERPLATE

Upvotes: 5

Views: 178

Answers (2)

luqui
luqui

Reputation: 60463

Not simplifying the logic manually (assuming you might generate this code automatically), I think using MultiWayIf is pretty clean and direct.

{-# LANGUAGE MultiWayIf #-}

data Stats = Stats {
    curr_linkDensity :: Double,
    prev_linkDensity :: Double,
    ...
}

data Classification = Content | Boilerplate

classify :: Stats -> Classification
classify s = if
    | curr_linkDensity s <= 0.333333 -> if
      | prev_linkDensity s <= 0.555556 -> if
        | curr_numWords s <= 16 -> if
          | next_numWords s <= 15 -> if
            | prev_numWords s <= 4 -> Boilerplate
            | prev_numWords s > 4 -> Content
          | next_numWords s > 16 -> Content
      ...

and so on.

However, since this is so structured -- just a tree of if/else with comparisons, also consider creating a decision tree data structure and writing an interpreter for it. This will allow you to do transformations, manipulations, inspections. Maybe it will buy you something; defining miniature languages for your specifications can be surprisingly beneficial.

data DecisionTree i o 
    = Comparison (i -> Double) Double (DecisionTree i o) (DecisionTree i o)
    | Leaf o

runDecisionTree :: DecisionTree i o -> i -> o
runDecisionTree (Comparison f v ifLess ifGreater) i
    | f i <= v  = runDecisionTree ifLess i
    | otherwise = runDecisionTree ifGreater i
runDecisionTree (Leaf o) = o

-- DecisionTree is an encoding of a function, and you can write
-- Functor, Applicative, and Monad instances!

Then

 classifier :: DecisionTree Stats Classification
 classifier =
     Comparison curr_linkDensity 0.333333
       (Comparison prev_linkDensity 0.555556
         (Comparison curr_numWords 16
           (Comparison next_numWords 15
             (Comparison prev_numWords 4
               (Leaf Boilerplate)
               (Leaf Content))
             (Leaf Content)
           ...

Upvotes: 11

that other guy
that other guy

Reputation: 123410

Since there are just three paths in this decision tree that leads to a BOILERPLATE state, I'd just iterate and simplify them:

isBoilerplate =
  prev_linkDensity   <= 0.555556 && curr_numWords <= 16 && prev_numWords <= 4
  || prev_linkDensity > 0.555556 && curr_numWords <= 40 && next_numWords <= 17
  || curr_linkDensity > 0.333333

Upvotes: 5

Related Questions