Hemanshu Bhojak
Hemanshu Bhojak

Reputation: 17288

Collecting data from web sites

I have two web pages

Page 1:

<data>
<item>
<name>Item 1</name>
<url>http://someUrl.html</url>
</item>
</data>

Page 2: http://someUrl.html

<data>
<info>Info 1</info>
<info>Info 2</info>
<info>Info 3</info>
</data>

I want to crawl page 1 and follow all the links there and generate the following output

Item 1, Info 1
Item 1, Info 2
Item 1, Info 3
...

How can i achieve this using Xidel?

Upvotes: 1

Views: 1143

Answers (2)

Reino
Reino

Reputation: 3433

You're talking about "all the links there", so instead of what you posted I'm going to assume as input:

<data>
  <item>
    <name>Item 1</name>
    <url>http://someUrl1.html</url>
  </item>
  <item>
    <name>Item 2</name>
    <url>http://someUrl2.html</url>
  </item>
  <item>
    <name>Item 3</name>
    <url>http://someUrl3.html</url>
  </item>
</data>

Linux:

xidel -s input.html -e 'for $item in //item for $info in doc($item/url)//info return $item/name||", "||$info'
#or
xidel -s input.html -e '
  for $item in //item
  for $info in doc($item/url)//info
  return
  $item/name||", "||$info
'

Windows:

xidel -s input.html -e "for $item in //item for $info in doc($item/url)//info return $item/name||', '||$info"
#or
xidel -s input.html -e ^"^
  for $item in //item^
  for $info in doc($item/url)//info^
  return^
  $item/name^|^|', '^|^|$info^
"

The 1st for-loop iterates over every <item>-node. The 2nd for-loop opens the url and iterates over every <info>-node. And the return clause is a simple string concatenation.

The output in this case:

Item 1, Info 1
Item 1, Info 2
Item 1, Info 3
Item 2, Info 4
Item 2, Info 5
Item 2, Info 6
Item 3, Info 7
Item 3, Info 8
Item 3, Info 9

Upvotes: 1

MatrixView
MatrixView

Reputation: 321

I recently found Xidel, so I'm no expert, but in my opinion it's an extremely powerful swiss-knife commandline scrape tool, that should be known by many more people.

Now, to answer your question I think the following (using html-templates) does exactly what you want:

xidel -q page1.html --extract-exclude=name -e "<name>{name:=text()}</name>*" -f "<url>{link:=text()}</url>*" -e "<info>{string-join(($name, text()), ', ')}</info>*" --hide-variable-names

Or, even shorter with CSS selectors:

xidel -q page1.html --extract-exclude=name -e "name:=css('name')" -f "link:=css('url')" -e "css('info')/string-join(($name,.),', ')" --hide-variable-names

Or, shortest with XPath:

xidel -q page1.html --extract-exclude=name -e name:=//name -f link:=//url -e "//info/string-join(($name,.),', ')" --hide-variable-names

The shortest line possible (but not in CSV format) would be:

xidel -q page1.html -e //name,//info -f //url

The above commands are for Windows, so make sure to swap the quotes <-> double quotes when on mac/ux! If you need explanation for the different parts in the lines, just ask... :-) Cheers!

Upvotes: 1

Related Questions