Gilles Quénot
Gilles Quénot

Reputation: 185680

Use a variable as XPath expression. Not expected behavior

To parse reddit.com, I use

xidel -e '//div[@data-click-id="background"]/div[@data-adclicklocation="title"]/div/a[@data-click-id="body"]/@href|//div[@data-click-id="background"]/div[@data-adclicklocation="title"]/div/a[@data-click-id="body"]/div/h3/text()' "https://www.reddit.com/r/bash" 

So the base XPath is repeated 2 times, then I decided to use a xidel variable:

xidel -se 'xp:=//div[@data-click-id="background"]/div[@data-adclicklocation="title"]/div/a[@data-click-id="body"]' \
    -e '$xp/@href|$xp/div/h3/text()' 'https://www.reddit.com/r/bash'

but the output differs from previous command.

Bonus if someone can give a way to remove \n concatenation but space concatenation, tried fn:string-join() and fn:concat() with no cigar.

Tried || " " || too, but not the expected url <description> for each matches

Upvotes: 0

Views: 116

Answers (1)

Reino
Reino

Reputation: 3443

The output doesn't differ if you would've added --extract-exclude=xp. Please see my answer here and the quote from the readme in particular.

What you're probably seeing:

xp := set -x is your friend
Homework questions.
Need some help with bash to combine two lists
Sshto update
Cannot pipe the output to a file
Worked a lot on this script lately

These are the text-nodes from your XPath-expression. It does actually save the element-nodes, but --output-node-format=text is the default afterall.

However, you really don't need these kind of internal variables for situations like this. I personally only use them for exporting to system variables. If you want to use variables, use a FLWOR expression:

$ xidel -s "https://www.reddit.com/r/bash" -e '
  for $x in //div[@data-adclicklocation="title"]/div/a[@data-click-id="body"] return
  ($x/@href,$x/div/h3)
'

$ xidel -s "https://www.reddit.com/r/bash" -e '
  let $a:=//div[@data-adclicklocation="title"]/div/a[@data-click-id="body"] return
  $a/(@href,div/h3)
'

But the simplest query, without the need for variables, would probably be:

$ xidel -s "https://www.reddit.com/r/bash" -e '
  //div[@data-adclicklocation="title"]/div/a[@data-click-id="body"]/(@href,div/h3)
'

String-joining is as simple as:

-e '.../join((@href,div/h3))'
-e '.../concat(@href," ",div/h3)'
-e '.../(@href||" "||div/h3)'
-e '.../x"{@href} {div/h3}"'

With || don't forget the parentheses, or there's no context-item for div/h3.
The last one is Xidel's own extended-string-syntax.


Alternatively, you could parse the huge JSON, which surprisingly lists a lot more Reddit questions:

$ xidel -s "https://www.reddit.com/r/bash" -e '
  parse-json(
    extract(//script[@id="data"],"window.___r = (.+);",1)
  )//posts/models/*[not(isSponsored)]/join((permalink,title))
'

Upvotes: 2

Related Questions