Reputation: 2127
Working on learning to scrape data from websites. I've been playing around with the rvest package and have got a hold of how to extract nodes with the selector gadget etc. For a quick project I'm looking to extract data from a flight website, turn it into a data frame which I can later subset and have emailed to me with flights that are useful. Anywho the code I'm working with is below.
library(rvest)
reg = paste("http://www.secretflying.com/usa-deals/")
#read the text from the flight deal-----------
fly_deals = read_html(reg)
fly_deals = html_nodes(fly_deals, ".entry-title a")
fly_deals = html_text(fly_deals)
fly_deals = as.data.frame(fly_deals)
#add link (not sure how to access the link)
fly_deals$correpsonding_link = 'corresponding_link'
#last step would filter out for NYC
fly_deals = fly_deals[grepl("NEW YORK", fly_deals$fly_deals),]
What I'd like to do now is access the page associated with each row aka each node, this way I can build another column with the corresponding link that can be accessed straight from my email. Hence the final product would look something like this:
appreciate any help!
Upvotes: 1
Views: 370
Reputation: 13274
Try:
library(rvest)
deals_link <- "http://www.secretflying.com/usa-deals/"
deals_info <- deals_link %>% read_html() %>%
html_nodes(".entry-title a")
fly_deals <- data.frame(deals = html_text(deals_info), correpsonding_link = html_attr(deals_info,"href"))
fly_deals[grepl("NEW YORK", fly_deals$deals),]
Output:
deals
NON-STOP FROM NEW YORK TO CARTAGENA, COLOMBIA FOR ONLY $328 ROUNDTRIP
XMAS & NEW YEAR: NEW YORK TO THE TURKS & CAICOS FOR ONLY $231 ROUNDTRIP
NEW YORK TO BOSTON (& VICE VERSA) FOR ONLY $66 ROUNDTRIP
correpsonding_link
http://www.secretflying.com/2016/new-york-cartagena-colombia-296-roundtrip/
http://www.secretflying.com/2016/hot-new-york-turks-caicos-58-one-way/
http://www.secretflying.com/2016/new-york-boston-vice-versa-66-roundtrip/
I hope this helps.
Upvotes: 1