dmcontador
dmcontador

Reputation: 668

Retrieve value from object in Javascript in XPATH

I need to extract information from HTML files. For most of them, I just need to match a particular DOM element's content or attribute, so I use XPATH expressions like //a[@class="targeturl"]/@href and the command line tool xidel.

In a different batch of files the information I want is in a script, not so readily available:

<html>
<head><!-- ... --></head>
<body>
    ...
    <script>
        ...
        var o = {
            "numeric": 1234,
            "target": "TARGET",
            "urls": "http://example.com",
            // Commented pair "strings": "...",
            "arrays": [
               {
                  "more": true
               }
               ,
               { 
                  "itgoeson": true
               }
            ]
        };
    </script>
    ...
</body>
</html>

Note that the object containing the value I want to get is not valid JSON. However, it seems to respect one key-value pair per line.

What can I pass to xidel --xpath "???" to get this TARGET?

I've tried different thing with XPATH functions but I can't get to a solution without piping to other commands (match tells me yes/no, replace works line by line..., etc).

Upvotes: 2

Views: 1597

Answers (2)

Reino
Reino

Reputation: 3433

What can I pass to xidel --xpath "???" to get this TARGET?

Since var o is actually JSON, I suggest you treat it as such:

-e "json(
      //script/extract(
        .,
        'var o = (.+);',
        1,'s'
      )[.]
    )/target"
  • Extract {"field1": 1234, "target": "TARGET", "morefields": "..."} from the <script> element node (the json covers several lines, so don't forget the 's' regex-flag).
  • Interpret the output as json by wrapping json( ) around it (or //script/...[.] ! json(.)) and select the target attribute.

[edit]
To remove the comments (beginning with //):

-e "json(
      //script/replace(
        extract(
          .,
          'var o = (.+);',
          1,'s'
        )[.],
        '\s+//.+',
        ''
      )
    )/target"

Not the most prettiest query, but it works.
[/edit]

Upvotes: 1

Andersson
Andersson

Reputation: 52675

Try to implement below XPath:

substring-before(substring-after(//script, '"target": '), ",")

Upvotes: 1

Related Questions