Reputation: 145
I am trying to scrape google search results using the following code. I want to take the title and the url of the first page of the results and then continue by scraping the next pages of the search results too. This is a sample of code that I just started writing:
from urllib.request import urlopen as uReq
import urllib.request
from bs4 import BeautifulSoup as soup
paging_url = "https://www.google.gr/search?q=donald+trump&ei=F91FW8XBGYjJsQHQwaWADA&start=110&sa=N&biw=811&bih=662"
req = urllib.request.Request("https://www.google.gr/search?q=donald+trump&ei
=F91FW8XBGYjJsQHQwaWADA&start=110&sa=N&biw=811&bih=662",headers = {'User-Agent':"Magic Browser"})
UClient = uReq(req) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
I noticed that all google results have a common class named "g". So I wrote the following command:
results= page_soup.findAll("div",{"class":"g"})
But after testing the results taken are not the same that I see when I visit the initial Url.
Moreover some div tags such as:
<div data-hveid="38" data-ved="0ahUKEwjGp7XEj5fcAhXMDZoKHRf8DJMQFQgmKAAwAA">
and
<div class="rc">
can not be seen in the tree that BeautifulSoup produces. Meaning I can not use findAll function to locate objects inside those tags because BeautifulSoup acts like they do not exist. Why all this happens?
Upvotes: 1
Views: 6834