Petris
Petris

Reputation: 145

How to scrape a google search results page?

I am trying to scrape google search results using the following code. I want to take the title and the url of the first page of the results and then continue by scraping the next pages of the search results too. This is a sample of code that I just started writing:

from urllib.request import urlopen as uReq
import urllib.request
from bs4 import BeautifulSoup as soup


paging_url = "https://www.google.gr/search?q=donald+trump&ei=F91FW8XBGYjJsQHQwaWADA&start=110&sa=N&biw=811&bih=662"

req = urllib.request.Request("https://www.google.gr/search?q=donald+trump&ei
=F91FW8XBGYjJsQHQwaWADA&start=110&sa=N&biw=811&bih=662",headers = {'User-Agent':"Magic Browser"})

UClient = uReq(req)  # downloading the url
page_html = UClient.read()
UClient.close()

page_soup = soup(page_html, "html.parser")

I noticed that all google results have a common class named "g". So I wrote the following command:

results= page_soup.findAll("div",{"class":"g"})

But after testing the results taken are not the same that I see when I visit the initial Url.

Moreover some div tags such as:

<div data-hveid="38" data-ved="0ahUKEwjGp7XEj5fcAhXMDZoKHRf8DJMQFQgmKAAwAA">

and

<div class="rc">

can not be seen in the tree that BeautifulSoup produces. Meaning I can not use findAll function to locate objects inside those tags because BeautifulSoup acts like they do not exist. Why all this happens?

Upvotes: 1

Views: 6834

Answers (0)

Related Questions