Reputation: 55
So I'm new to web scraping, and wanted to learn by trying to scrape the keurig website for fun, and extracting information about some of the k cups for sale. My goal is to go to the k-cups page, click on every k-cup and extract some information such as if it is caffeinated, the roast color, and maybe origin. I can tackle that stuff later, I'm having some trouble finding the CSS or finding a way to automate the process of clicking every object to get the extra info. I did this:
library(rvest)
keurig <- read_html("http://www.keurig.com/beverages/k-cup-pods")
# Grab the CSS Nodes from the website
keurig.html <- html_nodes(keurig, ".keurig_card")
keurig.text <- html_text(keurig.html)
# Print the text
keurig.text
I ended up getting a lot of tab and new line characters with some of the coffee names in between. How exactly would I scrape this data to grab the info about every k-cup?
Upvotes: 0
Views: 974
Reputation: 9313
Use this to get the links for every item:
library(rvest)
keurig <- read_html("http://www.keurig.com/beverages/k-cup-pods")
keurig.html <- html_nodes(keurig, ".product_name")
links = html_attr(keurig.html, name = "href")
The class that contains the links to every item is product_name
.
Once you get the nodes, extract the href
property.
Result (first four shown):
[1] "/Beverages/Coffee/Regular/Breakfast-Blend-Coffee/p/Breakfast-Blend-Coffee-K-Cup-Green-Mountain"
[2] "/Beverages/Coffee/Regular/Dark-Magic%C2%AE-Extra-Bold-Coffee/p/Dark-Magic-Extra-Bold-Coffee-K-Cup-Green-Mountain"
[3] "/Beverages/Coffee/Regular/The-Original-Donut-Shop%C2%AE-Coffee/p/Original-Donut-Shop-Extra-Bold-Coffee-K-Cup-CP"
[4] "/Beverages/Coffee/Regular/Nantucket-Blend%C2%AE-Coffee/p/Nantucket-Blend-Coffee-K-Cup-Green-Mountain"
Then use paste0
to create the link to each cake's details page:
paste0("http://www.keurig.com/beverages/k-cup-pods",
"/Beverages/Coffee/Regular/Breakfast-Blend-Coffee/p/Breakfast-Blend-Coffee-K-Cup-Green-Mountain")
Upvotes: 1