Reputation: 31
Background: We are aggregating content from some websites (with permission) for use in supplementary search functions for another application. An example is the news section of https://centenary.bahai.us. We thought to use xidel for this purpose, since the template file paradigm seems an elegant way to extract data from html, e.g. for a template:
<h1 class="title">{$title}</h1>?
<div class="node build-mode-full">
{$url:=$url}
<div class="field-field-audio">?
<audio src="{$audio:='https://' || $host || .}"></audio>?
</div>?
<div class="field-field-clip-img">
<a href="{$image:='https://' || $host || .}" class="imagefield-field_clip_img"></a>*
</div>?
<div class="field-field-pubname">{$publication}</div>?
<div class="field-field-historical-date">{$date}</div>?
<div class="location"><div class="adr">{$location}</div>?</div>?
<div class="node-body">{$text}</div>
</div>?
...we can run a command like the following:
xidel "https://centenary.bahai.us" -e "$(< template.html)" -f "//a[contains(@href, '/news/')]" --silent --color=never --output-format=json-wrapped > index.json
...which will give us json formatted data from all the news pages on centenary.bahai.us. An example article would look like this:
{
"title": "Bahá’ísm the Religion of Brotherhood",
"url": "https://centenary.bahai.us/news/bahaism-religion-brotherhood",
"audio": "https://centenary.bahai.us/sites/default/files/453_0.mp3",
"image": "https://centenary.bahai.us/sites/default/files/imagecache/lightbox-large/images/press_clippings/03-31-1912_NYT_Bahaism_the_Religion_of_Brotherhood.png",
"publication": "The New York Times",
"date": "March 31, 1912",
"location": "New York, NY",
"text": "A posthumous volume of “Essays in Radical Empiricism,” by William James, will be published in April by Longmans, Green & Co. This house will also bring out “Leo XIII, and Anglican Orders,” by Viscount Halifax, and “Bahá’ísm, the Religion of Brotherhood, and Its Place in the Evolution of Creeds,” by Francis H. Skrine. In the latter an analysis is made of the Gospel of Bahá’u’lláh and his successor. ‘Abdu’l-Bahá — whose arrival in this country is expected early in April — and a forecast is attempted of its influence on civilization."
},
That's just beautiful, and a whole lot easier than some mishmash of httrack and pup or (God forbid) sed and regex, but there are a few issues:
--silent
flag, we still get status messages in the output which invalidate the json, such as **** Retrieving (GET): https://centenary.bahai.us ****
or **** Processing: https://centenary.bahai.us/ ****
or ** Current variable state: **
Xidel seems like a game-changing tool which should make this job possible with a one-line command and a simple extract template file; what am I missing here?
Upvotes: 3
Views: 1250
Reputation: 3443
Judging from the use of $(< template.html)
I guess you're on a Linux distro. In that case your quoting is wrong. See #9 and #10 here.
Since you're using an extraction-template-file I would've said --extract-file=template.html
would be the parameter to use, but your -e "$(< template.html)"
appears to work too. That's new for me. Thanks.
And thanks to BeniBela's answer I've come to know -e @template.html
works as well.
Next is the wrong order of your parameters. I have to admit, Xidel's readme isn't all too clear on this.
Right after xidel
should come --silent --color=never
, and you'd obviously have to "follow" an url first before you can do an extraction. So this should work:
$ xidel --silent --color=never "https://centenary.bahai.us" \
-f '//div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href' \
--extract-file=template.html \
--output-format=json-wrapped \
> index.json
I hardly ever use templates myself, so I would've done things a bit differently by building the json myself:
$ xidel -s "https://centenary.bahai.us" -e '
for $x in //div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href return
file:write(
substring-after($x,"/news/")||".json",
doc($x)/{
"title"://h1/text(),
"url":resolve-uri($x),
"audio"://audio/resolve-uri(@src),
"image"://div[ends-with(@class,"clip-img")]//img/resolve-uri(@src),
"publication"://div[ends-with(@class,"pubname")]/div/normalize-space(div[@class="field-item odd"]),
"date"://div[ends-with(@class,"historical-date")]//span/text(),
"location"://span[@class="locality"]/text(),
"text":string-join(//div[@class="node-body"]//text())
},
{"method":"json","indent":true()}
)
'
//div[@class="views-field-title"]/span/a[starts-with(@href,"/news/")]/@href
returns the current news article relative paths:/news/visit-abdul-baha-abbas
/news/abdul-baha-prays-ascension-church
/news/bahaist-leader-here-interest-world-peace
/news/abdul-baha-abbas-coming-lewis-g-gregory
visit-abdul-baha-abbas.json
, for instance:{
"title": "A Visit to ‘Abdu’l-Bahá Abbas",
"url": "https:\/\/centenary.bahai.us\/news\/visit-abdul-baha-abbas",
"audio": "https:\/\/centenary.bahai.us\/sites\/default\/files\/257_0.mp3",
"image": "https:\/\/centenary.bahai.us\/sites\/default\/files\/imagecache\/page-secondary-images\/images\/press_clippings\/04-17-1912%20Utica%20NY%20Press%20A%20Visit%20to%20Abdul%20Baha%20Abbas.png",
"publication": "Utica New York Press",
"date": "April 17, 1912",
"location": "Acca",
"text": "An American Girl Tells of a Memorable Experience in Her Life.[...]"
}
Upvotes: 1
Reputation: 16927
You can build your own output in Xidel.
You can save JSON to a file using the file module in XQuery, e.g.:
file:write-text("/tmp/test.json", serialize-json({"a": 1}))
You can pass any JSON to serialize-json
{ "title": $title, "url": $url, "audio": $audio}
Or build it from a list of variables:
{| ("title", "url", "audio") ! {.: get(.)} |}
There are multiple approaches to catch errors:
The old school way:
xidel --silent "https://centenary.bahai.us/news" -e "//a[contains(@href, '/news/')]/resolve-html(.)" | sort | uniq |
while read -r u; do
xidel $u -e @template.html >> output.json
done
You can put the template pattern in a template xml file to obtain greater control than on the command line:
<action>
<page url="https://centenary.bahai.us/news"/>
<s>$temp := distinct-values( //a[contains(@href, '/news/')]/resolve-html(.) )</s>
<loop var="u" list="$temp">
<page url="{$u}"/>
<try>
<pattern><body>
<h1 class="title">{$title}</h1>?
<div class="node build-mode-full">
{$url:=$url}
<div class="field-field-audio">?
<audio src="{$audio:='https://' || $host || .}"></audio>?
</div>?
<div class="field-field-clip-img">
<a href="{$image:='https://' || $host || .}" class="imagefield-field_clip_img"></a>*
</div>?
<div class="field-field-pubname">{$publication}</div>?
<div class="field-field-historical-date">{$date}</div>?
<div class="location"><div class="adr">{$location}</div>?</div>?
<div class="node-body">{$text}</div>
</div>?
</body></pattern>
<catch>
</catch>
</try>
</loop>
</action>
<loop>
repeats something and <try><catch>
catches all errors.
And then call it with xidel --template-file action.xml
You can also call the pattern matching from XQuery using the function pxp:match
, although there you cnanot use the special built-in variables (like $url or $host. btw avoid modifying those variables, not sure what happens if you do. ):
xidel 'https://centenary.bahai.us/news' --xquery '
let $pattern := doc("file://./template.html")
for $u in distinct-values( //a[contains(@href, "/news/")]/resolve-html(.) )
return try {
pxp:match($pattern, doc($u))
} catch * {
"error"
}
'
Upvotes: 0