his1devil
his1devil

Reputation: 1

Python About check the Google Search URL's contents loop issues

# -*- coding: utf-8 -*-

import re
import csv
import urllib
import urllib2
import BeautifulSoup
Filter = [' ab1',' ab2',' dc4',....]
urllists = ['myurl1','myurl2','myurl3',...]
csvfile = file('csv_test.csv','wb')
writer = csv.writer(csvfile)
writer.writerow(['keyword','url'])
for eachUrl in urllists:
    for kword in Filter:
        keyword = "site:" + urllib.quote_plus(eachUrl) + kword
        safeKeyword = urllib.quote_plus(keyword)
        fullQuery = 'http://www.google.com/search?sourceid=chrome&client=ubuntu&channel=cs&    ie=UTF-8&q=' + safeKeyword

        req = urllib2.Request(fullQuery, headers = {'User-Agent': 'Mozilla/15.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/12.04 Chrome/21.0.118083 Safari/535.11'})
        html = urllib2.urlopen(req).read()

        soup = BeautifulSoup.BeautifulSoup(html, fromEncoding = 'utf8')

        resultURLList = [t.a['href'] for t in soup.findAll('h3', {'class':'r'})]

        if resultURLList:
            for l in resultURLList:
                needCheckHtml = urllib2.urlopen(l).read()
                if needCheckHtml:
                    x = re.compile(r"\b" + kword + r"\b")
                    p = x.search(needCheckHtml)
                    if p:
                        data = [kword, l]
                        writer.writerow(data)

        else:
            print '%s: No Results' % kword
csvfile.close()

A simple script about checking the url shows on google searchresults, and open it, check and match the keyword in list Filter use re, the above code, may cause some Error, for example, HTTPERROR, URLError, but i dont know how to fix and impove the code, can someone help me with that? Please.. if face some google reject, wanna use os.system("rasdial name user code") to reconnect the PPPOE and change the IP, so how fix this code Thanks very much !!

Upvotes: 0

Views: 632

Answers (1)

bpgergo
bpgergo

Reputation: 16037

I'm not sure how much this helps, but there is a search API that you can use without Google blocking your request and without the need to change your IP address; although there are some restrictions here as well.

http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=AnT4i

{"responseData": {"results":[{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://www.ncbi.nlm.nih.gov/pubmed/11526138","url":"http://www.ncbi.nlm.nih.gov/pubmed/11526138","visibleUrl":"www.ncbi.nlm.nih.gov","cacheUrl":"","title":"Identification of aminoglycoside-modifying enzymes by susceptibility \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"Identification of aminoglycoside-modifying enzymes by susceptibility ...","content":"In 381 Japanese MRSA isolates, the \u003cb\u003eant(4\u0026#39;)-I\u003c/b\u003e, aac(6\u0026#39;)-aph(2\u0026quot;), and aph(3\u0026#39;)-III   genes \u003cb\u003e...\u003c/b\u003e Isolates with only the \u003cb\u003eant(4\u0026#39;)-I\u003c/b\u003e gene had coagulase type II or III, but   isolates \u003cb\u003e...\u003c/b\u003e"},{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://www.ncbi.nlm.nih.gov/pubmed/1047990","url":"http://www.ncbi.nlm.nih.gov/pubmed/1047990","visibleUrl":"www.ncbi.nlm.nih.gov","cacheUrl":"","title":"[\u003cb\u003eANT(4\u0026#39;)I\u003c/b\u003e: a new aminoglycoside nucleotidyltransferase found in \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"[ANT(4\u0026#39;)I: a new aminoglycoside nucleotidyltransferase found in ...","content":"[\u003cb\u003eANT(4\u0026#39;)I\u003c/b\u003e: a new aminoglycoside nucleotidyltransferase found in \u0026quot;staphylococcus   aureus\u0026quot; (author\u0026#39;s transl)]. [Article in French]. Le Goffic F, Baca B, Soussy CJ, \u003cb\u003e...\u003c/b\u003e"},{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://jcm.asm.org/content/27/11/2535","url":"http://jcm.asm.org/content/27/11/2535","visibleUrl":"jcm.asm.org","cacheUrl":"","title":"Use of plasmid analysis and determination of aminoglycoside \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"Use of plasmid analysis and determination of aminoglycoside ...","content":"Aminoglycoside resistance pattern determinations revealed the presence of the   \u003cb\u003eANT(4\u0026#39;)-I\u003c/b\u003e enzyme (aminoglycoside 4\u0026#39; adenyltransferase) in all group 1 isolates \u003cb\u003e...\u003c/b\u003e"},{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://ukpmc.ac.uk/articles/PMC88306","url":"http://ukpmc.ac.uk/articles/PMC88306","visibleUrl":"ukpmc.ac.uk","cacheUrl":"","title":"Identification of Aminoglycoside-Modifying Enzymes by \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"Identification of Aminoglycoside-Modifying Enzymes by ...","content":"The technique used three sets of primers delineating specific DNA fragments of   the aph(3\u0026#39;)-III, \u003cb\u003eant(4\u0026#39;)-I\u003c/b\u003e, and aac(6\u0026#39;)-aph(2\u0026quot;) genes, which influence the MICs of \u003cb\u003e...\u003c/b\u003e"}],"cursor":{"resultCount":"342","pages":[{"start":"0","label":1},{"start":"4","label":2},{"start":"8","label":3},{"start":"12","label":4},{"start":"16","label":5},{"start":"20","label":6},{"start":"24","label":7},{"start":"28","label":8}],"estimatedResultCount":"342","currentPageIndex":0,"moreResultsUrl":"http://www.google.com/search?oe\u003dutf8\u0026ie\u003dutf8\u0026source\u003duds\u0026start\u003d0\u0026hl\u003den\u0026q\u003dAnT4i","searchResultTime":"0.25"}}, "responseDetails": null, "responseStatus": 200}

see http://googlesystem.blogspot.hu/2008/04/google-search-rest-api.html

Upvotes: 1

Related Questions