Reputation: 668
I need to extract information from HTML files. For most of them, I just need to match a particular DOM element's content or attribute, so I use XPATH expressions like //a[@class="targeturl"]/@href
and the command line tool xidel.
In a different batch of files the information I want is in a script
, not so readily available:
<html>
<head><!-- ... --></head>
<body>
...
<script>
...
var o = {
"numeric": 1234,
"target": "TARGET",
"urls": "http://example.com",
// Commented pair "strings": "...",
"arrays": [
{
"more": true
}
,
{
"itgoeson": true
}
]
};
</script>
...
</body>
</html>
Note that the object containing the value I want to get is not valid JSON. However, it seems to respect one key-value pair per line.
What can I pass to xidel --xpath "???"
to get this TARGET
?
I've tried different thing with XPATH functions but I can't get to a solution without piping to other commands (match
tells me yes/no, replace
works line by line..., etc).
Upvotes: 2
Views: 1597
Reputation: 3433
What can I pass to
xidel --xpath "???"
to get thisTARGET
?
Since var o
is actually JSON, I suggest you treat it as such:
-e "json(
//script/extract(
.,
'var o = (.+);',
1,'s'
)[.]
)/target"
{"field1": 1234, "target": "TARGET", "morefields": "..."}
from the <script>
element node (the json covers several lines, so don't forget the 's'
regex-flag).json(
)
around it (or //script/...[.] ! json(.)
) and select the target
attribute.[edit]
To remove the comments (beginning with //
):
-e "json(
//script/replace(
extract(
.,
'var o = (.+);',
1,'s'
)[.],
'\s+//.+',
''
)
)/target"
Not the most prettiest query, but it works.
[/edit]
Upvotes: 1
Reputation: 52675
Try to implement below XPath:
substring-before(substring-after(//script, '"target": '), ",")
Upvotes: 1