Reputation: 17288
I have two web pages
Page 1:
<data>
<item>
<name>Item 1</name>
<url>http://someUrl.html</url>
</item>
</data>
Page 2: http://someUrl.html
<data>
<info>Info 1</info>
<info>Info 2</info>
<info>Info 3</info>
</data>
I want to crawl page 1 and follow all the links there and generate the following output
Item 1, Info 1
Item 1, Info 2
Item 1, Info 3
...
How can i achieve this using Xidel?
Upvotes: 1
Views: 1143
Reputation: 3433
You're talking about "all the links there", so instead of what you posted I'm going to assume as input:
<data>
<item>
<name>Item 1</name>
<url>http://someUrl1.html</url>
</item>
<item>
<name>Item 2</name>
<url>http://someUrl2.html</url>
</item>
<item>
<name>Item 3</name>
<url>http://someUrl3.html</url>
</item>
</data>
Linux:
xidel -s input.html -e 'for $item in //item for $info in doc($item/url)//info return $item/name||", "||$info'
#or
xidel -s input.html -e '
for $item in //item
for $info in doc($item/url)//info
return
$item/name||", "||$info
'
Windows:
xidel -s input.html -e "for $item in //item for $info in doc($item/url)//info return $item/name||', '||$info"
#or
xidel -s input.html -e ^"^
for $item in //item^
for $info in doc($item/url)//info^
return^
$item/name^|^|', '^|^|$info^
"
The 1st for-loop iterates over every <item>
-node. The 2nd for-loop opens the url and iterates over every <info>
-node. And the return clause is a simple string concatenation.
The output in this case:
Item 1, Info 1
Item 1, Info 2
Item 1, Info 3
Item 2, Info 4
Item 2, Info 5
Item 2, Info 6
Item 3, Info 7
Item 3, Info 8
Item 3, Info 9
Upvotes: 1
Reputation: 321
I recently found Xidel, so I'm no expert, but in my opinion it's an extremely powerful swiss-knife commandline scrape tool, that should be known by many more people.
Now, to answer your question I think the following (using html-templates) does exactly what you want:
xidel -q page1.html --extract-exclude=name -e "<name>{name:=text()}</name>*" -f "<url>{link:=text()}</url>*" -e "<info>{string-join(($name, text()), ', ')}</info>*" --hide-variable-names
Or, even shorter with CSS selectors:
xidel -q page1.html --extract-exclude=name -e "name:=css('name')" -f "link:=css('url')" -e "css('info')/string-join(($name,.),', ')" --hide-variable-names
Or, shortest with XPath:
xidel -q page1.html --extract-exclude=name -e name:=//name -f link:=//url -e "//info/string-join(($name,.),', ')" --hide-variable-names
The shortest line possible (but not in CSV format) would be:
xidel -q page1.html -e //name,//info -f //url
The above commands are for Windows, so make sure to swap the quotes <-> double quotes when on mac/ux! If you need explanation for the different parts in the lines, just ask... :-) Cheers!
Upvotes: 1