user4302594
user4302594

Reputation:

Unable to get google search results python

I'm building a script to scrape google search results. I've reached till here.

import urllib
keyword = "google"
print urllib.urlopen("https://www.google.co.in/search?q=" + keyword).read()

But it gives me a reply as follows:

<!DOCTYPE html><html lang=en><meta charset=utf-8><meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"><title>Error 403 (Forbidden)!!1</title><style>*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/errors/logo_sm_2.png) no-repeat}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/errors/logo_sm_2_hr.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:55px;width:150px}</style><a href=//www.google.com/><span id=logo aria-label=Google></span></a><p><b>403.</b> <ins>That’s an error.</ins><p>Your client does not have permission to get URL <code>/search?q=google</code> from this server.  (Client IP address: 117.196.168.89)<br><br>
Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html
<BR><BR><P>If you believe that you have received this response in error, please <A HREF="http://www.google.com/support/bin/request.py?contact_type=user&hl=en">report</A> your problem. However, please make sure to take a look at our Terms of Service (http://www.google.com/terms_of_service.html). In your email, please send us the <b>entire</b> code displayed below.  Please also send us any information you may know about how you are performing your Google searches-- for example, "I'm using the Opera browser on Linux to do searches from home.  My Internet access is through a dial-up account I have with the FooCorp ISP." or "I'm using the Konqueror browser on Linux to search from my job at myFoo.com.  My machine's IP address is 10.20.30.40, but all of myFoo's web traffic goes through some kind of proxy server whose IP address is 10.11.12.13."  (If you don't know any information like this, that's OK.  But this kind of information can help us track down problems, so please tell us what you can.)</P><P>We will use all this information to diagnose the problem, and we'll hopefully have you back up and searching with Google again quickly!</P>
<P>Please note that although we read all the email we receive, we are not always able to send a personal response to each and every email.  So don't despair if you don't hear back from us!</P>
<P>Also note that if you do not send us the <b>entire</b> code below, <i>we will not be able to help you</i>.</P><P>Best wishes,<BR>The Google Team</BR></P><BLOCKQUOTE>/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/<BR>
aef0-l8vNw3cWys_OWGKrv6VYDewUx0bhWxSeo2Mk4vGTZSoh<BR>
MdeNZki3vp-kzRGjrBTseg6uGBypibuTNGSeJoPRkDPCOFkyA<BR>
YBVgssaJaqSibV7khohBnsUVRVZqALwIe2lD6pdddMQIZ-Zg2<BR>
WEE-rO-ZackE5L2gwlmHZHP2oWML3ZlGgUL6CAbMbFmzVda38<BR>
ZYYVZLKBcjY1gSLk-FSzBc7QQnp0vrhkY6LnrALX94oK7Yrml<BR>
bKX-5KmpyhsI7aW3da5Rt5nt0K9PVPbKvpZ1LN-hdRqg749K6<BR>
T4v8mGfuH6BHSQUAPW1Byx_Wy1TGsyhZJQ02jrz7K0RBg4r0i<BR>
9O6Rs7-FFRzESkiyzRQaExUdpBpl3Mmguh1JXR_yxDJre9R7u<BR>
3AWKfCkt8BxKuv37oAIslM2Caor4QBXSNrq1F7zUetx8HxmaW<BR>
pX_6KsXyjs3-Pfq5NKOuzNCjatrhXdKC74NmNHztTPJU-4MzV<BR>
kUPuUehnDYgcgGAVYLLGiWvG4Scm8G2Gq2UnacMQsZ5BB7rgY<BR>
DXJnZwbMbVX53-llhCMeQfBTteOWIfWQR2FOyc-tuaRHX6c3N<BR>
rzpNDX9ZufFfOXRNkaORCZxkSEoX1xDBq0VGdkkCfwlUdG9Jq<BR>
prYBPnpRyhjxjC3c4n68AuEYHtMTVmbK-fyMtcWLMTVXzIrYS<BR>
EjACpMTnHRavhYza4ZJgs4SViS4FrsmJ0P3CdyLLayR0xMFM6<BR>
m7rxy-zaABo7iof_re5PKcFP6EYqD0Wm-ZlLksUh2a1LVaAsq<BR>
sSqnPPqq5qCu0z8wQe5jeGCRCY2vrT5HWmYNJbhyCyN_HiHGR<BR>
bHDb8f3_OcgAHsT7zv1a4FOG4B0JztqskzYmssBb-ezvErkp6<BR>
uZtwiKJc30F30RpQhKEb_rPjhpwc5dr3MUsTuki2j2tBSQl_O<BR>
kjFef_Jvl3u8TPQY5c6dqUSQv--p0N95Jv-WehS32lvyUbeEB<BR>
mN7ZC8oCFj06BRn5NaU9P8p1d7fmYyxyta2dZ21UfaRMhX8TZ<BR>
VgKiSDVyMO2GZ09bUEFGW4KvvTJDyQT_UMkCsahrv2MP_yI-D<BR>
fwEArSXvPIpyESHeyPXfFN-Z9_OuVwGDU2riHFIWgw5IPwtER<BR>
e0Ukzrn2iwGHHL8j2JdSNbunrifS-RqkK2hgQl16-TfqN11NL<BR>
Lgwtt-Kp3XL86K61Qq7lU-NxB8BOO_i-QOQszn6uRmb3VR__Q<BR>
T_0E9FULbsR9kgTyXDKQmOQ-3qeaFlz4in9V9PJ<BR>
+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+<BR></BLOCKQUOTE>

Doesn't google allow its pages to be scraped?

Upvotes: 1

Views: 2832

Answers (3)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

Google treats your script with a different user-agent(if you're using requests it will be python-requests) See more and more.

All you need is just to specify browser user-agent (Chrome, Mozilla, Edge, IE, Safari..) so Google will treat it as a "user" AKA fake a real browser visit.

If you're using requests library, then you can specify it this way (list of user-agents amoung other websites)

import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
  'https://www.google.com/search?q=pizza is awesome', headers=headers).text

I answered the question on how to scrape Google Search result titles, summary and links with example code here.


Alternatively, you can use third-party Google Search Engine Results API or Google Organic Results API from SerpApi. It's a paid API with a free trial.

Check out Playground to test and see the output.

Code to get raw HTML response:

import os, urllib
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "london",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

html = results['search_metadata']['raw_html_file']
print(urllib.request.urlopen(html).read())

Disclaimer, I work for SerpApi.

Upvotes: 0

sk11
sk11

Reputation: 1824

You could also fake the headers in the urllib to get the results.

Something like:

import urllib2

keyword = "google"
url = "https://www.google.co.in/search?q=" + keyword

# Build a opener
opener = urllib2.build_opener()

# In case you have proxy then u need to build a ProxyHandler opener 
#opener = urllib2.build_opener(urllib2.ProxyHandler(proxies={"http": "http://proxy.corp.ads:8080"}))

# To fake the browser
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open(url).read()

Upvotes: 1

T90
T90

Reputation: 577

Actually, google doesn't, in the sense it blocks bots. But you can use mechanize to fake a browser and get the results.


    import mechanize
    chrome = mechanize.Browser()
    chrome.set_handle_robots(False)
    chrome.addheaders = [('User-agent', 
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36')]
    base_url = 'https://www.google.co.in/search?q='
    search_url = base_url + keyword.replace(' ', '+')
    htmltext = chrome.open(search_url).read()

try this. I hope it helps.

Upvotes: 2

Related Questions