Lucas
Lucas

Reputation: 154

Python Google Search

I try to search for a word in google with python. Then I try to extract it into a list and print the list. But now I've got this problem:

class search:
    def __init__(self, search):
        page = requests.get("http://www.google.de/search?q="+search)
        soup = BeautifulSoup(page.content)
        links = soup.findAll("a")
        for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
            print re.split(":(?=http)",link["href"].replace("/url?q=",""))

search("lol")

This works. But look at the output:

['http://euw.leagueoflegends.com/de&sa=U&ved=0ahUKEwie3sWOkbHRAhVGGCwKHSChAWQQFggVMAA&usg=AFQjCNEkd1xB6jaSnzWz-VpYcnHvSNYMJA']

['http://webcache.googleusercontent.com/search%3Fq%3Dcache:as12jwqcnbAJ', 'http://euw.leagueoflegends.com/de%252Blol%26hl%3Dde%26ct%3Dclnk&sa=U&ved=0ahUKEwie3sWOkbHRAhVGGCwKHSCqewsfdvfgh1A&usg=AFQjCNEm132qewdasDq2hCb9SRjnbmbMb3rkw']

(and so on)

How do I put this into a list!? And how can I remove this webcache thing?

I know that it's utf8 encoded but I can simply decode it with urllib2.

Thank you in advance!

Upvotes: 3

Views: 1719

Answers (1)

Eric Duminil
Eric Duminil

Reputation: 54303

This should bring you closer. links wasn't used. The method now returns a list without strings containing webcache :

from bs4 import BeautifulSoup
import requests
import re

class Google:
    @classmethod
    def search(self, search):
        page = requests.get("http://www.google.de/search?q="+search)
        soup = BeautifulSoup(page.content)
        links = soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)"))
        urls = [re.split(":(?=http)",link["href"].replace("/url?q=",""))[0] for link in links]
        return [url for url in urls if 'webcache' not in url]

print Google.search("lol")

It outputs

[u'http://euw.leagueoflegends.com/de&sa=U&ved=0ahUKEwixjpPMmrHRAhUHlSwKHUIuCIIQFggVMAA&usg=AFQjCNEkd1xB6jaSnzWz-VpYcnHvSNYMJA', u'http://euw.leagueoflegends.com/de/news/&sa=U&ved=0ahUKEwixjpPMmrHRAhUHlSwKHUIuCIIQjBAIHDAB&usg=AFQjCNGY7DvS4oNNQktCTf3FGtStOG9xvA', u'http://gameinfo.euw.leagueoflegends.com/de/game-info/&sa=U&ved=0ahUKEwixjpPMmrHRAhUHlSwKHUIuCIIQjBAIHjAD&usg=AFQjCNGrvfhy3JIOHWUYB-YtyFV2A...

Upvotes: 3

Related Questions