Reputation: 69
On windows 7 in batch and xidel I test on a website with pagination like this example :
link1
link2
link3
1 2 3 4 5 6 7 8 9 10 Next
i find a way to have first 10 links :
xidel.exe https://www.website.es/search?q=xidel+follow+pagination^&start=0 --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
but when i try to follow to page 2 or page (n) with
-f "<A class="fl">{.}</A>"
or
--follow "//a/[@class='nav']"
nothink work, can you give me some help or some example ?
Thanks.
Upvotes: 1
Views: 1027
Reputation: 3433
With this simple query...
xidel "https://www.google.com/search?q=xidel+follow+pagination" -e "$url"
https://consent.google.com/ml?continue=[...]
...you'll notice we're hitting a cookie-wall. With -f "//form"
Xidel can "click" on the consent-button.
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/@href"
/url?q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAYQAg&usg=AOvVaw2Yyh9OVSR_FLKehWApnFK2
/url?q=https://stackoverflow.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAIQAg&usg=AOvVaw25MiKPwJB0jVHz2JTl5mBp
/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAgQAg&usg=AOvVaw3BfrZCAGHHs_nqpJ-1aj2u
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/resolve-uri(@href)"
https://www.google.com/url?q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAkQAg&usg=AOvVaw2o5RqheOFbiQv-KFW7Jhxd
https://www.google.com/url?q=https://stackoverflow.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAgQAg&usg=AOvVaw19rnj9nPwMX-zKVSNzacrw
https://www.google.com/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAcQAg&usg=AOvVaw3T4VVe92ucN0Jc7hzvAn8Y
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))"
{
"url": "https://www.google.com/url?q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg&usg=AOvVaw1qftOzBqM1OfXkWkkJm0B8",
"protocol": "https",
"host": "www.google.com",
"path": "url",
"query": "q=https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url&sa=U&ved=2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg&usg=AOvVaw1qftOzBqM1OfXkWkkJm0B8",
"params": {
"q": "https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url",
"sa": "U",
"ved": "2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg",
"usg": "AOvVaw1qftOzBqM1OfXkWkkJm0B8"
}
}
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q"
https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url
https://stackoverflow.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
Above final command extracts the urls from the 1st results page. To include the urls from the other results pages you can do a "recursive follow":
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q" ^
-f "//a[@aria-label and contains(.,'>')]"
-f "//a[@aria-label and contains(.,'>')]"
"clicks" the next-page-button until there are no more.
Note the warning by Xidel's author though: !!! Recursive follow is deprecated and might be removed soon. !!!
.
form()
A better alternative would be to visit the homepage and submit the search term through form()
. A user-agent is needed, but the cookie-consent-button is automatically "clicked" and the HTML-source is easier to parse.
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href"
https://stackoverflow.com/questions/37262813/xidel-how-to-follow-pagination-html-and-extract-url
https://stackoverflow.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
This can be done by yet another "recursive follow":
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href" ^
-f "//a[@id='pnnext']/@href"
Changing the form()
-parameters however is a lot easier in this case:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination','num':'100'})" -e "//div[@class='yuRUbf']/a/@href"
I don't know if num
has a hard limit or not, but 100 seems to work at least.
Upvotes: 2
Reputation: 321
Reino is right. But querying Google can also be done like this:
xidel -s "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination','num':'25'})" ^
-e "//a/extract(@href,'url\?q=(.+?)&',1)[.]"
Upvotes: 3