wizzup
wizzup

Reputation: 2411

Problem parsing adjcent block of tags with scalpel

I have problem using scalpel to capture block of tags.

Given following HTML snippet store in testS :: String

<body>
  <h2>Apple</h2>
  <p>I Like Apple</p>
  <p>Do you like Apple?</p>

  <h2>Banana</h2>
  <p>I Like Banana</p>
  <p>Do you like Banana?</p>

  <h2>Carrot</h2>
  <p>I Like Carrot</p>
  <p>Do you like Carrot?</p>
</body>

I want to parse block of h2 and two p as a single record Block.

{-#LANGUAGE OverloadedStrings #-}

import Control.Monad
import Text.HTML.Scalpel

data Block = B String String String
  deriving Show

block :: Scraper String Block
block = do
  h  <- text $ "h2"
  pa <- text $ "p"
  pb <- text $ "p"
  return $ B h pa pb

blocks :: Scraper String [Block]
blocks = chroot "body" $ replicateM 3 block

But the result of scraping is not what I want, look like it keep repeat capturing the first block and never consume it.

λ> traverse (mapM_ print) $ scrapeStringLike testS blocks
B "Apple" "I Like Apple" "I Like Apple"
B "Apple" "I Like Apple" "I Like Apple"
B "Apple" "I Like Apple" "I Like Apple"

Expected output:

B "Apple" "I Like Apple" "Do you like Apple?"
B "Banana" "I Like Banana" "Do you like Banana?"
B "Carrot" "I Like Carrot" "Do you like Carrot?"

How to make it work?

Upvotes: 1

Views: 350

Answers (2)

fimad
fimad

Reputation: 346

This is now supported in version 0.6.0 of scalpel through the use of SerialScrapers. SerialScrapers allow you to focus on one child of the current root at a time and expose APIs to move the focus and execute Scrapers on the currently focused node.

Adapting the example code in the documentation to your HTML gives:

-- Copyright 2019 Google LLC.
-- SPDX-License-Identifier: Apache-2.0

-- Chroot to the body tag and start a SerialScraper context with inSerial.
-- This will allow for focusing each child of body.
--
-- Many applies the subsequent logic repeatedly until it no longer matches 
-- and returns the results as a list.
chroot "body" $ inSerial $ many $ do
   -- Move the focus forward until text can be extracted from an h2 tag.
   title <- seekNext $ text "h2"
   -- Create a new SerialScraper context that contains just the tags between
   -- the current focus and the next h2 tag. Then until the end of this new 
   -- context, move the focus forward to the next p tag and extract its text.
   ps <- untilNext (matches "h2") (many $ seekNext $ text "p")
   return (title, ps)

Which would return:

[
  ("Apple", ["I like Apple", "Do you like Apple?"]),
  ("Banana", ["I like Banana", "Do you like Banana?"]),
  ("Carrot", ["I like Carrot", "Do you like Carrot?"])
]

Upvotes: 1

trevor cook
trevor cook

Reputation: 1600

First, I apologize for proposing a solution without testing or knowing anything about scalpel (such arrogance). Let me make it up to you; here's my totally rewritten attempt.

First, this monstrosity works.

blocks :: Scraper String [Block]
blocks = chroot "body" $ do
  hs <- texts "h2"
  ps <- texts "p"
  return $ combine hs ps
  where
    combine (h:hs) (p:p':ps) = B h p p' : combine hs ps
    combine _ _ = []

I call it a monstrosity because it erases the structure of the document with the two texts calls and then recreates it in the assumed order via combine. This probably isn't such a big deal in practice though, since most pages will be structured by combining tags via <div>.

So, if we were to have a different page:

testS' :: String
testS'= unlines [ "<body>",
              "<div>",
              "  <h2>Apple</h2>",
              "  <p>I Like Apple</p>",
              "  <p>Do you like Apple?</p>",
              "</div>",
              "",
              "<div>",
              "  <h2>Banana</h2>",
              "  <p>I Like Banana</p>",
              "  <p>Do you like Banana?</p>",
              "",
              "</div>",
              "<div>",
              "  <h2>Carrot</h2>",
              "  <p>I Like Carrot</p>",
              "  <p>Do you like Carrot?</p>",
              "</div>",
              "</body>"
              ]

Then we can parse via:

block' :: Scraper String Block
block' = do
  h  <- text $ "h2"
  [pa,pb] <- texts $ "p"
  return $ B h pa pb

blocks' :: Scraper String [Block]
blocks' = chroots ("body" // "div") $ block'

Yielding,

B "Apple" "I Like Apple" "Do you like Apple?"
B "Banana" "I Like Banana" "Do you like Banana?"
B "Carrot" "I Like Carrot" "Do you like Carrot?"

Edit: re >>= and combine

My combine, above, is a local where definition. What you see there is what you get. Its unrelated to the function used in >>=, which incidentally is also a locally defined function with a slightly different name—combined. Even if they had the same name, however, it wouldn’t matter since each is only in scope within their respective functions.

As for the >>=, and just going by the observed behavior, each scrape starts from the beginning of the currently selected tags. So in your block definition, chroot “body” returns all tags in the body, text “h2” matches the first <h2>, and the next two text “p” both match the first <p>. So the bind is acting like an “and”: given the scalpel context of a bunch of tags match an <h2> and a <p> and (redundantly) a <p>. Notice that in my <div> based parse i could use texts (note the “s”) to get the two <p> i was expecting.

Finally, this behavior clicked for me when i saw it was based on tag soup. (Simultaneously with why they named it tag soup). Each of these scrapes are like dipping a spoon into an unordered soup of tags. The selector makes the soup, the scraper is your spoon. Hope that helps.

Upvotes: 1

Related Questions