Reputation: 251
I've taught myself web scraping and would like to scrape data from Google daily search trend here: https://trends.google.com/trends/trendingsearches/daily?geo=US The data would include search keywords, their ranks, search frequencies on a daily basis.
I tried scraping with R using rvest library at the beginning but extracting command scraped empty data. I guess html structure of the website is too complex than rvest is based? So I'd like to learn better way that can apply to the website.
I searched for some information specific to the daily search scraping but couldn't find as most posts were concerned with extracting Google Trend data rather than the daily search.
What would be an effective way to extract data from the website, or more general, this kind of website? I'm happy to learn any other tools than R and have basic knowledge in Python and Javascript. If anyone could give me a hint, then I will dig into it but at the moment I have no idea even where to start.
Thanks,
Upvotes: 1
Views: 4150
Reputation: 1750
Have a look at the HTML using the 'inpect element' tool in firefox.
Essentially, we can see that every element you want to scrape from the webpage can be distinguished easily based on the tooltip :
Given that, we can use selenium to scrape the webpage to retrive this information.
(Install it first with pip3 install -U selenium
and install your favorite webdiver from the links here)
Start a browser and direct it to the google trends page using something similar to
╰─ ipython3
Python 3.7.0 (default, Jun 29 2018, 20:13:13)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from selenium import webdriver
In [2]: browser = webdriver.Firefox()
...: browser.get('https://trends.google.com/trends/trendingsearches/daily?geo=US')
You should now see something similar to this :
Again, using the inspect element tool, get the class of the div that contain every element to scrape :
We need to find the div with a class named feed-list-wrapper
.
In [3]: list_div = browser.find_element_by_class_name("feed-list-wrapper")
In [4]: list_div
Out[4]: <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="b889702e-7e2b-7448-9180-c9fb3d1ff641", element="cad96530-3444-9d4f-a8e8-b7da780f5751")>
Once done, just get the list of the div details :
In [5]: details_divs = list_div.find_elements_by_class_name("details")
And, for example, get the title (you should understand the code by now)
In [6]: for detail_div in details_divs:
...: print(detail_div.find_element_by_class_name("details-top").find_element_by_xpath("div/span/a").text)
...:
Captain Marvel
Celia Barquin Arozamena
Yom Kippur
Lethal White
National Cheeseburger Day 2018
Ind vs HK
Mario Kart
Barcelona
Emilia Clarke
Elementary
Angela Bassett
Lenny Kravitz
Lil Uzi Vert
Handmaid's Tale
Mary Poppins Returns trailer
Hannah Gadsby
Another example, to get the view count :
In [7]: for detail_div in details_divs:
...: title = detail_div.find_element_by_class_name("details-top").find_element_by_xpath("div/span/a").text
...: search_count = detail_div.find_element_by_xpath('..').find_element_by_class_name("search-count-title").text
...: print("Title : {title} \t\t\t Searchs : {search_count}".format(title=title, search_count=search_count))
...:
Title : Captain Marvel Searchs : 500 k+
Title : Celia Barquin Arozamena Searchs : 200 k+
Title : Yom Kippur Searchs : 100 k+
Title : Lethal White Searchs : 50 k+
Title : National Cheeseburger Day 2018 Searchs : 50 k+
Title : Ind vs HK Searchs : 50 k+
Title : Mario Kart Searchs : 50 k+
Title : Barcelona Searchs : 50 k+
Title : Emilia Clarke Searchs : 50 k+
Title : Elementary Searchs : 20 k+
Title : Angela Bassett Searchs : 20 k+
Title : Lenny Kravitz Searchs : 20 k+
Title : Lil Uzi Vert Searchs : 20 k+
Title : Handmaid's Tale Searchs : 20 k+
Title : Mary Poppins Returns trailer Searchs : 20 k+
Title : Hannah Gadsby Searchs : 20 k+
You should get used to selenium quickly. If you have any doubt on the methos used here, here is a link to the selenium docs
Upvotes: 7