Reputation: 31

Can we use Xidel for extracting data across a site into search files?

Background: We are aggregating content from some websites (with permission) for use in supplementary search functions for another application. An example is the news section of https://centenary.bahai.us. We thought to use xidel for this purpose, since the template file paradigm seems an elegant way to extract data from html, e.g. for a template:

<h1 class="title">{$title}</h1>?
<div class="node build-mode-full">
  {$url:=$url}
  <div class="field-field-audio">?
    <audio src="{$audio:='https://' || $host || .}"></audio>?
  </div>?
  <div class="field-field-clip-img">
    <a href="{$image:='https://' || $host || .}" class="imagefield-field_clip_img"></a>*
  </div>?
  <div class="field-field-pubname">{$publication}</div>?
  <div class="field-field-historical-date">{$date}</div>?
  <div class="location"><div class="adr">{$location}</div>?</div>?
  <div class="node-body">{$text}</div>
</div>?

...we can run a command like the following:

xidel "https://centenary.bahai.us" -e "$(< template.html)" -f "//a[contains(@href, '/news/')]" --silent --color=never --output-format=json-wrapped > index.json

...which will give us json formatted data from all the news pages on centenary.bahai.us. An example article would look like this:

{
"title": "Bahá’ísm the Religion of Brotherhood", 
"url": "https://centenary.bahai.us/news/bahaism-religion-brotherhood", 
"audio": "https://centenary.bahai.us/sites/default/files/453_0.mp3", 
"image": "https://centenary.bahai.us/sites/default/files/imagecache/lightbox-large/images/press_clippings/03-31-1912_NYT_Bahaism_the_Religion_of_Brotherhood.png", 
"publication": "The New York Times", 
"date": "March 31, 1912", 
"location": "New York, NY", 
"text": "A posthumous volume of “Essays in Radical Empiricism,” by William James, will be published in April by Longmans, Green & Co. This house will also bring out “Leo XIII, and Anglican Orders,” by Viscount Halifax, and “Bahá’ísm, the Religion of Brotherhood, and Its Place in the Evolution of Creeds,” by Francis H. Skrine. In the latter an analysis is made of the Gospel of Bahá’u’lláh and his successor. ‘Abdu’l-Bahá — whose arrival in this country is expected early in April — and a forecast is attempted of its influence on civilization."
},

That's just beautiful, and a whole lot easier than some mishmash of httrack and pup or (God forbid) sed and regex, but there are a few issues:

We would like to have separate files for each document, whereas this gives us one large json file.
Even with the --silent flag, we still get status messages in the output which invalidate the json, such as **** Retrieving (GET): https://centenary.bahai.us **** or **** Processing: https://centenary.bahai.us/ **** or ** Current variable state: **
The process seems too fragile; if there is any discrepancy between the template and the actual html, the entire process errors out and we get nothing. We would prefer to have it output an error for only the one page and then continue with the next URL.

Xidel seems like a game-changing tool which should make this job possible with a one-line command and a simple extract template file; what am I missing here?

Upvotes: 3

Answers (2)

Reino

Reputation: 3443

Judging from the use of $(< template.html) I guess you're on a Linux distro. In that case your quoting is wrong. See #9 and #10 here.

Since you're using an extraction-template-file I would've said --extract-file=template.html would be the parameter to use, but your -e "$(< template.html)" appears to work too. That's new for me. Thanks.
And thanks to BeniBela's answer I've come to know -e @template.html works as well.

Next is the wrong order of your parameters. I have to admit, Xidel's readme isn't all too clear on this.
Right after xidel should come --silent --color=never, and you'd obviously have to "follow" an url first before you can do an extraction. So this should work:

$ xidel --silent --color=never "https://centenary.bahai.us" \
  -f '//div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href' \
  --extract-file=template.html \
  --output-format=json-wrapped \
  > index.json

I hardly ever use templates myself, so I would've done things a bit differently by building the json myself:

$ xidel -s "https://centenary.bahai.us" -e '
  for $x in //div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href return
  file:write(
    substring-after($x,"/news/")||".json",
    doc($x)/{
      "title"://h1/text(),
      "url":resolve-uri($x),
      "audio"://audio/resolve-uri(@src),
      "image"://div[ends-with(@class,"clip-img")]//img/resolve-uri(@src),
      "publication"://div[ends-with(@class,"pubname")]/div/normalize-space(div[@class="field-item odd"]),
      "date"://div[ends-with(@class,"historical-date")]//span/text(),
      "location"://span[@class="locality"]/text(),
      "text":string-join(//div[@class="node-body"]//text())
    },
    {"method":"json","indent":true()}
  )
'

//div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href returns the current news article relative paths:

/news/visit-abdul-baha-abbas
/news/abdul-baha-prays-ascension-church
/news/bahaist-leader-here-interest-world-peace
/news/abdul-baha-abbas-coming-lewis-g-gregory

For every news article the url is opened, the html-source parsed and the extracted information saved as an indented/prettified JSON file. The first one, visit-abdul-baha-abbas.json, for instance:

{
  "title": "A Visit to ‘Abdu’l-Bahá Abbas",
  "url": "https:\/\/centenary.bahai.us\/news\/visit-abdul-baha-abbas",
  "audio": "https:\/\/centenary.bahai.us\/sites\/default\/files\/257_0.mp3",
  "image": "https:\/\/centenary.bahai.us\/sites\/default\/files\/imagecache\/page-secondary-images\/images\/press_clippings\/04-17-1912%20Utica%20NY%20Press%20A%20Visit%20to%20Abdul%20Baha%20Abbas.png",
  "publication": "Utica New York Press",
  "date": "April 17, 1912",
  "location": "Acca",
  "text": "An American Girl Tells of a Memorable Experience in Her Life.[...]"
}

Upvotes: 1

BeniBela

Reputation: 16927

You can build your own output in Xidel.

You can save JSON to a file using the file module in XQuery, e.g.:

file:write-text("/tmp/test.json", serialize-json({"a": 1}))

You can pass any JSON to serialize-json

 { "title": $title, "url": $url, "audio": $audio}

Or build it from a list of variables:

 {| ("title", "url", "audio") ! {.: get(.)}  |}

There are multiple approaches to catch errors:

Call Xidel multiple times in bash

The old school way:

xidel --silent "https://centenary.bahai.us/news" -e "//a[contains(@href, '/news/')]/resolve-html(.)"  | sort | uniq | 
while read -r u; do 
  xidel $u -e @template.html >> output.json
done

Use a multipage template

You can put the template pattern in a template xml file to obtain greater control than on the command line:

  <action>
    <page url="https://centenary.bahai.us/news"/>
    <s>$temp := distinct-values( //a[contains(@href, '/news/')]/resolve-html(.) )</s>

    <loop var="u" list="$temp">
      <page url="{$u}"/>
      <try>
        <pattern><body>
          <h1 class="title">{$title}</h1>?
          <div class="node build-mode-full">
            {$url:=$url}
            <div class="field-field-audio">?
              <audio src="{$audio:='https://' || $host || .}"></audio>?
            </div>?
            <div class="field-field-clip-img">
              <a href="{$image:='https://' || $host || .}" class="imagefield-field_clip_img"></a>*
            </div>?
            <div class="field-field-pubname">{$publication}</div>?
            <div class="field-field-historical-date">{$date}</div>?
            <div class="location"><div class="adr">{$location}</div>?</div>?
            <div class="node-body">{$text}</div>
          </div>?
        </body></pattern>
        <catch>
        </catch>
      </try>
    </loop>

  </action>

<loop> repeats something and <try><catch> catches all errors.

And then call it with xidel --template-file action.xml

Call the pattern matching from XQuery

You can also call the pattern matching from XQuery using the function pxp:match, although there you cnanot use the special built-in variables (like $url or $host. btw avoid modifying those variables, not sure what happens if you do. ):

 xidel  'https://centenary.bahai.us/news' --xquery '
    let $pattern := doc("file://./template.html")
    for $u in distinct-values( //a[contains(@href, "/news/")]/resolve-html(.) )
    return try {
      pxp:match($pattern, doc($u))
    } catch * { 
       "error"
    }
 '

Upvotes: 0

Can we use Xidel for extracting data across a site into search files?

Answers (2)

Call Xidel multiple times in bash

Use a multipage template

Call the pattern matching from XQuery

Related Questions