Reputation: 47
i am trying to parse a website using google search results, the returned url of the page from the google search results is correct, but while trying to parse it, the website shows a cpatcha that needs to be solved, is it possible to display the captcha back to user to solve it and then continue the parsing of that page
the html file of the first parsing had this script that contains the captcha details
<script>
var recaptchaReady = function () {
grecaptcha.render('iCaptcha', {'sitekey': 'XXXXXXXXXXXXXXXXX-XXXXXXXXX'});
captchaRendered = true;
};
window.onload = function () {
if (!window['captchaRendered']) {
var captchaElement = document.getElementById('iCaptcha'),
captchaFallback = document.getElementById('captcha-noscript'),
captchaFallbackText = document.getElementById('captcha-fallback-text');
captchaElement.innerHTML = captchaFallback.textContent;
captchaFallbackText.textContent = captchaFallback.getAttribute('data-text');
}
};
</script>
this is the python code that i am using to get the url that i am looking for:
from googlesearch import search
from bs4 import BeautifulSoup
search_result_list = list(search(query, tld="de", num=1, stop=1, pause=1))
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
page = requests.get(search_result_list[0], headers=headers)
with open("output1.html", "w") as file:
file.write(page.text)
Upvotes: 0
Views: 962
Reputation: 337
Although I am not using google 3.0.0 module from PyPI, you can use the way based on Google Cache along with a referer (in the header) that can help bypass the captcha.
For example, if I want to search "ukraine war"
import requests
from bs4 import BeautifulSoup
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
query = "ukraine war"
html = requests.get(f"http://webcache.googleusercontent.com/search?q=cache:{query}",headers=header)
soup = BeautifulSoup(html.text, 'lxml')
soup.find_all(['h3','cite'])
outputs:
[<h3 class="LC20lb MBeuO DKV0Md">Ukraine: Russia 'destroys cache of Western arms' | Metro News</h3>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://metro.co.uk<span class="dyjrff qzEoUe" role="text"> › News › World</span></cite>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://metro.co.uk<span class="dyjrff qzEoUe" role="text"> › News › World</span></cite>,
<h3 class="LC20lb MBeuO DKV0Md">Russian military police uncover Ukrainian arms cache ... - TASS</h3>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://tass.com<span class="dyjrff qzEoUe" role="text"> › politics</span></cite>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://tass.com<span class="dyjrff qzEoUe" role="text"> › politics</span></cite>,
<h3 class="LC20lb MBeuO DKV0Md">Zelenskiy says Donbas is 'completely destroyed' - The Guardian</h3>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.theguardian.com<span class="dyjrff qzEoUe" role="text"> › live › may</span></cite>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.theguardian.com<span class="dyjrff qzEoUe" role="text"> › live › may</span></cite>,
<h3 class="LC20lb MBeuO DKV0Md">Protesters in Ukraine guard biggest weapons cache in eastern ...</h3>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.theguardian.com<span class="dyjrff qzEoUe" role="text"> › world › apr</span></cite>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.theguardian.com<span class="dyjrff qzEoUe" role="text"> › world › apr</span></cite>,
<h3 aria-level="2" class="GmE3X" role="heading">「cache:ukraine war」的圖片搜尋結果</h3>,
<h3 class="LC20lb MBeuO DKV0Md">Inside the trenches: CNN joins Ukraine's army on the front lines</h3>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.cnn.com<span class="dyjrff qzEoUe" role="text"> › videos › world › 2022/05/26 › uk...</span></cite>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.cnn.com<span class="dyjrff qzEoUe" role="text"> › videos › world › 2022/05/26 › uk...</span></cite>,
<h3 class="LC20lb MBeuO DKV0Md">What Happened on Day 79 of the War in Ukraine - The New ...</h3>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.nytimes.com<span class="dyjrff qzEoUe" role="text"> › live › 2022/05/13 › world › r...</span></cite>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.nytimes.com<span class="dyjrff qzEoUe" role="text"> › live › 2022/05/13 › world › r...</span></cite>,
<h3 class="LC20lb MBeuO DKV0Md">Ukraine Becomes the World's “First TikTok War” - The New ...</h3>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.newyorker.com<span class="dyjrff qzEoUe" role="text"> › culture</span></cite>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.newyorker.com<span class="dyjrff qzEoUe" role="text"> › culture</span></cite>,
<h3 class="LC20lb MBeuO DKV0Md">NATO's weapons supply to Ukraine may divert cache to illegal ...</h3>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.republicworld.com<span class="dyjrff qzEoUe" role="text"> › natos-w...</span></cite>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.republicworld.com<span class="dyjrff qzEoUe" role="text"> › natos-w...</span></cite>,
<h3 class="LC20lb MBeuO DKV0Md">Ukraine Braces For Escalated Russian Attacks Ahead Of ...</h3>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.rferl.org<span class="dyjrff qzEoUe" role="text"> › ukraine-donbas-o...</span></cite>,
<cite class="iUh30 qLRx3b tjvcx" role="text">https://www.rferl.org<span class="dyjrff qzEoUe" role="text"> › ukraine-donbas-o...</span></cite>,
<h3 class="O3JH7"><span class="q8U8x">相關搜尋</span></h3>]
there're things to note from @Joshua's answer before you're using cache way.
Upvotes: 1