user3203275
user3203275

Reputation: 305

Scrape link in R

I am working a project on R. I want to find the link aosmith.com as it's exposed on Wikipedia page https://en.wikipedia.org/wiki/A._O._Smith. May my question has been again asked but I haven't managed to find a solution yet. What I did so far is the following but without success:

library(rvest)
library(magrittr)

url <- "https://en.wikipedia.org/wiki/A._O._Smith"
links <- read_html(url) %>% html_nodes(., ".lister-item-header a") %>% html_attr(., "href")

Upvotes: 1

Views: 149

Answers (3)

Joshua Mire
Joshua Mire

Reputation: 736

This should work for any Wikipedia link set to url and will return only the desired URL:

library(rvest)
library(magrittr)

url <- "https://en.wikipedia.org/wiki/A._O._Smith"
link<-read_html(url) %>% html_nodes(".infobox") %>% html_nodes(".url>a")%>% html_attr(name='href')

Upvotes: 2

Gorka
Gorka

Reputation: 2071

Using he inspector tool of the browser (F12 and Ctrl+Shift+C), you could copy the xpath of the link (click aosmith.com, then in the panel right click on the blue box). In R, use the copied xpath to access the desired element.

link <- read_html(url) %>%
         html_nodes(xpath='//*[@id="mw-content-text"]/div/table/tbody/tr[19]/td/span/a') %>%
         html_attr(., "href")

enter image description here

Upvotes: 1

Allan Cameron
Allan Cameron

Reputation: 173803

You get more control and generalisability by using a specific xpath expression. This xpath expression just searches for the link with the text "A.O. Smith". Compared to using numbered xpaths generated by the browser, this is less likely to break if/when the page is updated.

 library(rvest)
 library(magrittr)

 url  <- "https://en.wikipedia.org/wiki/A._O._Smith"
 link <- read_html(url) %>% 
         html_nodes(xpath = "//a[text() = 'A.O. Smith']") %>%
         html_attr("href")
 link
 #> [1] "http://www.aosmith.com"

Upvotes: 2

Related Questions