Reputation: 154
I try to search for a word in google with python. Then I try to extract it into a list and print the list. But now I've got this problem:
class search:
def __init__(self, search):
page = requests.get("http://www.google.de/search?q="+search)
soup = BeautifulSoup(page.content)
links = soup.findAll("a")
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
print re.split(":(?=http)",link["href"].replace("/url?q=",""))
search("lol")
This works. But look at the output:
['http://euw.leagueoflegends.com/de&sa=U&ved=0ahUKEwie3sWOkbHRAhVGGCwKHSChAWQQFggVMAA&usg=AFQjCNEkd1xB6jaSnzWz-VpYcnHvSNYMJA']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:as12jwqcnbAJ', 'http://euw.leagueoflegends.com/de%252Blol%26hl%3Dde%26ct%3Dclnk&sa=U&ved=0ahUKEwie3sWOkbHRAhVGGCwKHSCqewsfdvfgh1A&usg=AFQjCNEm132qewdasDq2hCb9SRjnbmbMb3rkw']
(and so on)
How do I put this into a list!? And how can I remove this webcache thing?
I know that it's utf8 encoded but I can simply decode it with urllib2.
Thank you in advance!
Upvotes: 3
Views: 1719
Reputation: 54303
This should bring you closer. links
wasn't used. The method now returns a list without strings containing webcache
:
from bs4 import BeautifulSoup
import requests
import re
class Google:
@classmethod
def search(self, search):
page = requests.get("http://www.google.de/search?q="+search)
soup = BeautifulSoup(page.content)
links = soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)"))
urls = [re.split(":(?=http)",link["href"].replace("/url?q=",""))[0] for link in links]
return [url for url in urls if 'webcache' not in url]
print Google.search("lol")
It outputs
[u'http://euw.leagueoflegends.com/de&sa=U&ved=0ahUKEwixjpPMmrHRAhUHlSwKHUIuCIIQFggVMAA&usg=AFQjCNEkd1xB6jaSnzWz-VpYcnHvSNYMJA', u'http://euw.leagueoflegends.com/de/news/&sa=U&ved=0ahUKEwixjpPMmrHRAhUHlSwKHUIuCIIQjBAIHDAB&usg=AFQjCNGY7DvS4oNNQktCTf3FGtStOG9xvA', u'http://gameinfo.euw.leagueoflegends.com/de/game-info/&sa=U&ved=0ahUKEwixjpPMmrHRAhUHlSwKHUIuCIIQjBAIHjAD&usg=AFQjCNGrvfhy3JIOHWUYB-YtyFV2A...
Upvotes: 3