jhonitalia
jhonitalia

Reputation: 69

xidel how to follow pagination html and extract URL?

On windows 7 in batch and xidel I test on a website with pagination like this example :

link1

link2

link3

1 2 3 4 5 6 7 8 9 10 Next

i find a way to have first 10 links :

xidel.exe https://www.website.es/search?q=xidel+follow+pagination^&start=0 --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"

but when i try to follow to page 2 or page (n) with

-f "<A class="fl">{.}</A>"

or

--follow "//a/[@class='nav']"

nothink work, can you give me some help or some example ?

Thanks.

Upvotes: 1

Views: 1027

Answers (2)

Reino
Reino

Reputation: 3433

Search term in url query-string

With this simple query...

xidel "https://www.google.com/search?q=xidel+follow+pagination" -e "$url"
https://consent.google.com/ml?continue=[...]

...you'll notice we're hitting a cookie-wall. With -f "//form" Xidel can "click" on the consent-button.

Extract the urls:
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
      -f "//form" -e "//div[@class='egMi0 kCrYT']/a/@href"
/url?q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAYQAg&usg=AOvVaw2Yyh9OVSR_FLKehWApnFK2
/url?q=https://stackoverflow.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAIQAg&usg=AOvVaw25MiKPwJB0jVHz2JTl5mBp
/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAgQAg&usg=AOvVaw3BfrZCAGHHs_nqpJ-1aj2u
[...]

xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
      -f "//form" -e "//div[@class='egMi0 kCrYT']/a/resolve-uri(@href)"
https://www.google.com/url?q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAkQAg&usg=AOvVaw2o5RqheOFbiQv-KFW7Jhxd
https://www.google.com/url?q=https://stackoverflow.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAgQAg&usg=AOvVaw19rnj9nPwMX-zKVSNzacrw
https://www.google.com/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAcQAg&usg=AOvVaw3T4VVe92ucN0Jc7hzvAn8Y
[...]

xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
      -f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))"
{
  "url": "https://www.google.com/url?q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg&usg=AOvVaw1qftOzBqM1OfXkWkkJm0B8",
  "protocol": "https",
  "host": "www.google.com",
  "path": "url",
  "query": "q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg&usg=AOvVaw1qftOzBqM1OfXkWkkJm0B8",
  "params": {
    "q": "https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url",
    "sa": "U",
    "ved": "2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg",
    "usg": "AOvVaw1qftOzBqM1OfXkWkkJm0B8"
  }
}
[...]

xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
      -f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q"
https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url
https://stackoverflow.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
Follow pagination:

Above final command extracts the urls from the 1st results page. To include the urls from the other results pages you can do a "recursive follow":

xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
      -f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q" ^
      -f "//a[@aria-label and contains(.,'>')]"

-f "//a[@aria-label and contains(.,'>')]" "clicks" the next-page-button until there are no more.
Note the warning by Xidel's author though: !!! Recursive follow is deprecated and might be removed soon. !!!.

Search term through form()

A better alternative would be to visit the homepage and submit the search term through form(). A user-agent is needed, but the cookie-consent-button is automatically "clicked" and the HTML-source is easier to parse.

Extract the urls:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
      -f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href"
https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url
https://stackoverflow.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
Follow pagination:

This can be done by yet another "recursive follow":

xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
      -f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href" ^
      -f "//a[@id='pnnext']/@href"

Changing the form()-parameters however is a lot easier in this case:

xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
      -f "form(//form,{'q':'xidel follow pagination','num':'100'})" -e "//div[@class='yuRUbf']/a/@href"

I don't know if num has a hard limit or not, but 100 seems to work at least.

Upvotes: 2

MatrixView
MatrixView

Reputation: 321

Reino is right. But querying Google can also be done like this:

xidel -s "https://www.google.com" ^
      -f "form(//form,{'q':'xidel follow pagination','num':'25'})" ^
      -e "//a/extract(@href,'url\?q=(.+?)&',1)[.]"

Upvotes: 3

Related Questions