Jiang Liang
Jiang Liang

Reputation: 376

how to scrape a hyperlink by R and keep the hyperlink clickable in the output file?

I am an R beginner who is trying to look through all the top 500/1000 vote/frequent questions in StackOverflow.

I need a data.frame with two variables that contain the question and the hyperlink of this question, respectively.

The webpage is here and need hyperlinks like this:

        <h3><a href="/questions/5963269/how-to-make-a-great-r-reproducible-example" class="question-hyperlink">How to make a great R reproducible example</a></h3>

output like this:

         question                                    link
    1    how-to-make-a-great-r-reproducible-example  <questions/5963269/how-to-make-a-great-r-reproducible-example>
    2 ...
    3 ...

sorry for my confused questions: I would like to have the hyperlink that can link to the webpage by clicking it. so I get the full url by

web <- data.frame(sapply(answer$link, function(x) {paste("https://stackoverflow.com",x, sep = "")}))

or

web <- data.frame(sapply(df[2], function(x) {paste("https://stackoverflow.com",x, sep = "")}))

web[1]
https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

like [here] (How to make a great R reproducible example)

I output this file to txt or CSV, but the links could not link the page when I click it.

could you improve it? thanks again

@DaveT @QHarr

Upvotes: 0

Views: 1524

Answers (2)

QHarr
QHarr

Reputation: 84465

You should really use the API Stack provide. However, you can do this with one level of css selector to gather the a tags by class attribute then separate out text for href with tidyverse functionality; and then perhaps generate a tibble...

library(tidyverse)
library(rvest)

nodes <- read_html('https://stackoverflow.com/questions/tagged/r?tab=votes&page=1&pagesize=50')%>%html_nodes("[class=question-hyperlink]")

df <- map_df(nodes,~{
  questions = .x %>% html_text()
  links =  paste0('https://stackoverflow.com',.x %>% html_attr("href") )
  tibble(questions, links)
})

enter image description here

Upvotes: 2

Dave2e
Dave2e

Reputation: 24089

This is a straight forward problem using the rvest package. The principal is to read the page, extract the desired nodes using the CSS selectors and then extracting the requested information.
The tricky part here is to isolate the links associated only with the questions and none of the others. In this case I needed 3-4 levels CSS tags to complete separation.

See the comments in the code for the step by step instructions.

library(rvest)

url<-"https://stackoverflow.com/questions/tagged/r?tab=votes&page=1&pagesize=50"

#read the page
page<-read_html(url)

#get hyperlink nodes
#the 'a' tag under a 'h3' tag under 'div' tag of class 'summary' under a 'div' tag of class 'question-summary'
nodes<-html_nodes(page, "div.question-summary div.summary h3 a")

#Get text
question<-html_text(nodes)
#get link
link<-paste0("https://stackoverflow.com", html_attr(nodes, "href"))

answer<-data.frame(question, link)
head(answer)
                                                         question                                                                             link
1                       How to make a great R reproducible example                    /questions/5963269/how-to-make-a-great-r-reproducible-example
2                    How to sort a dataframe by multiple column(s)                   /questions/1296646/how-to-sort-a-dataframe-by-multiple-columns
3      How to join (merge) data frames (inner, outer, left, right)          /questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right
4 Grouping functions (tapply, by, aggregate) and the *apply family   /questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family
5                                  Drop data frame columns by name                               /questions/4605206/drop-data-frame-columns-by-name
6  Remove rows with all or some NAs (missing values) in data.frame /questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame

Upvotes: 2

Related Questions