Reputation: 5897
I am working with the R programming language. I have a list that contains HTTP links (amongst other things) and looks something like this:
library(rvest)
library(httr)
library(XML)
url<-"mywebsite.com"
page <-read_html(url)
links1 = page %>% html_nodes("li")
head(links1)
{xml_nodeset (393)}
[3] <li class="social-icon"><a class="tip-me" href="https://www.youtube.com/channel/UCYNT3iuUwsnEwelGScQ3k1A/videos" data-toggle="tooltip" data-animation="true" title= ...
[4] <li class="social-icon"><a class="tip-me" href="https://www.web222.ca" data- ...
[5] <li class="social-icon"><a class="tip-me" href="#" data-toggle="tooltip" data-animation="true" title=""><span class="icon-dribbble"></span></a></li>
[6] <li><a href="https://www.web777.ca/">Home</a></li>\n
[7] <li id="menu-item-17" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home menu-item-17"><a href="https://www.web555.ca/" ...
[8] <li id="menu-item-2606" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-2606"><a href="https://www.web111.ca">L ...
[9] <li id="menu-item-18618" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-has-children menu-item-18618">\n<a href="#">Local Listings</a>\n< ...
[10] <li id="menu-item-10758" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-10758"><a href="https://www.web123.ca/den ...
[11] <li id="menu-item-1227" class="menu-item menu-item-type-taxonomy menu-item-object-listings_categories menu-item-1227"><a href="https://www.web123.c ...
[12] <li id="menu-item-1226" class="menu-item menu-item-type-taxonomy menu-item-object-listings_categories menu-item-1226"><a href="https://www.web124.c ...
[13] <li id="menu-item-883" class=
I want to extract every URL contained in this list - I think these are stored in the "href" part of the list. I tried different ways to do this - but in the end, I figured out a slightly different way of doing this:
# source: https://www.geeksforgeeks.org/extract-all-the-urls-from-the-webpage-using-r-language/
# making http request
resource <- GET(url)
# converting all the data to HTML format
parse <- htmlParse(resource)
# scrapping all the href tags
links2 <- xpathSApply(parse, path="//a", xmlGetAttr, "href")
# printing links
print(links2)
My Question: I would have thought there might be someway to extract the links from "links1" instead of having to approach this problem from a different method as I did with "links2". Can someone please show me how I would have extracted the URL links from "links1"?
Thanks!
Upvotes: 0
Views: 57
Reputation: 6563
Try this
links1 = page %>%
html_nodes("li a") %>%
html_attr("href") %>%
html_text2()
Upvotes: 1