GMan
GMan

Reputation: 81

Using urlopen I can get the html of the page, but a crucial part is missing

I am trying to make a script that gets similar images from google using a url, using a part from this code.

The problem is, that I want to get to this link, because from it I can get to the images themselves by cloicking on the "search by image" link, but when I use the script, I get the exact same page, but without the "search by image" link.

I would like to know why and if there is a way to fix it.

Thanks a lot in advance!

P.S. Here's the code

import os
from urllib2 import Request, urlopen
from cookielib import LWPCookieJar

USER_AGENT = r"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)"
LOCAL_PATH = r"C:\scripts\google_search"
COOKIE_JAR_FILE = r".google-cookie"

class google_search(object):
    def cleanup(self):
        if os.path.isfile(self.cookie_jar_path):
            os.remove(self.cookie_jar_path)

        os.chdir(LOCAL_PATH)
        for html in os.listdir("."):
            if html.endswith(".html"):
                os.remove(html)

    def __init__(self, cookie_jar_path):
        self.cookie_jar_path = cookie_jar_path
        self.cookie_jar = LWPCookieJar(self.cookie_jar_path)
        self.counter = 0
        self.cleanup()
        try:
            cookie.load()
        except Exception:
            pass


    def get_html(self, url):
        request = Request(url = url)

        request.add_header("User-Agent", USER_AGENT)
        self.cookie_jar.add_cookie_header(request)
        response = urlopen(request)
        self.cookie_jar.extract_cookies(response, request)
        html_response = response.read()
        response.close()
        self.cookie_jar.save()
        return html_response


def main():
    url_2 = r"http://www.google.com/search?hl=en&q=http%3A%2F%2Fi.imgur.com%2FqGRxTNA.jpg&btnG=Google+Search"
    search = google_search(os.path.join(LOCAL_PATH, COOKIE_JAR_FILE))
    html_2 = search.get_html(url_2)


if __name__ == '__main__':
    main()

Upvotes: 1

Views: 87

Answers (1)

UltraInstinct
UltraInstinct

Reputation: 44444

I have tried something of that sort a few weeks back. My server used to reject my requests with a 404 because I was not setting a proper user agent.

In your case, you are not setting the user agent properly. Pasting my User-Agent header.

USER_AGENT = r"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36"

PS: I hope you have read the T & C of Google. You might be violating the terms.

Upvotes: 1

Related Questions