Reputation: 376
I am an R beginner who is trying to look through all the top 500/1000 vote/frequent questions in StackOverflow.
I need a data.frame with two variables that contain the question and the hyperlink of this question, respectively.
The webpage is here and need hyperlinks like this:
<h3><a href="/questions/5963269/how-to-make-a-great-r-reproducible-example" class="question-hyperlink">How to make a great R reproducible example</a></h3>
output like this:
question link
1 how-to-make-a-great-r-reproducible-example <questions/5963269/how-to-make-a-great-r-reproducible-example>
2 ...
3 ...
sorry for my confused questions: I would like to have the hyperlink that can link to the webpage by clicking it. so I get the full url by
web <- data.frame(sapply(answer$link, function(x) {paste("https://stackoverflow.com",x, sep = "")}))
or
web <- data.frame(sapply(df[2], function(x) {paste("https://stackoverflow.com",x, sep = "")}))
web[1]
https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
like [here] (How to make a great R reproducible example)
I output this file to txt or CSV, but the links could not link the page when I click it.
could you improve it? thanks again
@DaveT @QHarr
Upvotes: 0
Views: 1524
Reputation: 84465
You should really use the API Stack provide. However, you can do this with one level of css selector to gather the a
tags by class attribute then separate out text
for href
with tidyverse
functionality; and then perhaps generate a tibble...
library(tidyverse)
library(rvest)
nodes <- read_html('https://stackoverflow.com/questions/tagged/r?tab=votes&page=1&pagesize=50')%>%html_nodes("[class=question-hyperlink]")
df <- map_df(nodes,~{
questions = .x %>% html_text()
links = paste0('https://stackoverflow.com',.x %>% html_attr("href") )
tibble(questions, links)
})
Upvotes: 2
Reputation: 24089
This is a straight forward problem using the rvest package. The principal is to read the page, extract the desired nodes using the CSS selectors and then extracting the requested information.
The tricky part here is to isolate the links associated only with the questions and none of the others. In this case I needed 3-4 levels CSS tags to complete separation.
See the comments in the code for the step by step instructions.
library(rvest)
url<-"https://stackoverflow.com/questions/tagged/r?tab=votes&page=1&pagesize=50"
#read the page
page<-read_html(url)
#get hyperlink nodes
#the 'a' tag under a 'h3' tag under 'div' tag of class 'summary' under a 'div' tag of class 'question-summary'
nodes<-html_nodes(page, "div.question-summary div.summary h3 a")
#Get text
question<-html_text(nodes)
#get link
link<-paste0("https://stackoverflow.com", html_attr(nodes, "href"))
answer<-data.frame(question, link)
head(answer)
question link
1 How to make a great R reproducible example /questions/5963269/how-to-make-a-great-r-reproducible-example
2 How to sort a dataframe by multiple column(s) /questions/1296646/how-to-sort-a-dataframe-by-multiple-columns
3 How to join (merge) data frames (inner, outer, left, right) /questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right
4 Grouping functions (tapply, by, aggregate) and the *apply family /questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family
5 Drop data frame columns by name /questions/4605206/drop-data-frame-columns-by-name
6 Remove rows with all or some NAs (missing values) in data.frame /questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame
Upvotes: 2