Reputation: 783
I'm thinking there has to be a simple answer here, but I can't seem to find it.
I am scraping various web pages and I want to pull down all links from the web page. I am using htmlParse to do this and am about 95% of the way there, but need some assistance.
This is my code to grab the web page
MyURL <- "http://stackoverflow.com/"
MyPage <- htmlParse(MyURL) # Parse the web page
URLroot <- xmlRoot(MyPage) # Get root node
Once I have the root node, I can run this to get the a nodes
URL_Links <- xpathSApply(URLroot, "//a") # get all hrefs from root
which gives me output like this
[[724]]
<a href="//area51.stackexchange.com" title="proposing new sites in the Stack Exchange network">Area 51</a>
[[725]]
<a href="//careers.stackoverflow.com">Stack Overflow Careers</a>
[[726]]
<a href="http://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a>
Alternatively, I can run this
URL_Links_values = xpathSApply(URLroot, "//a", xmlGetAttr, "href") # Get all href values
which gets just the HREF values like this
[[721]]
[1] "http://creativecommons.org/licenses/by-sa/3.0/"
[[722]]
[1] "http://blog.stackoverflow.com/2009/06/attribution-required/"
However, what I am looking for is a way to get both the HREF value and the name of the link easily, preferrably loaded into a data frame or matrix so that instead of getting this returned
<a href="http://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a>
<a href="http://blog.stackoverflow.com/2009/06/attribution-required/" rel="license">attribution required</a>
I get this
Name HREF
1 cc by-sa 3.0 http://creativecommons.org/licenses/by-sa/3.0/
2 attribution required http://blog.stackoverflow.com/2009/06/attribution-required/
Now I could take the output of URL_Links and do some regex or split the strings apart to get this data, but it just seems like there should be a simpler way to do this using the XML package.
Is there an easy way to do what I am looking to do?
Edit:
Just figured out I can do this to get the URL names
URL_Links_names <- xpathSApply(URLroot, "//a", xmlValue) # Get all href values
However when I run this
df <- data.frame(URL_Links_names, URL_Links_values)
I get this error
Error in data.frame("//stackoverflow.com", "http://chat.stackoverflow.com", : arguments imply differing number of rows: 1, 0
I'm guessing there are links with no name, so how do I get that to retrn "" or NA for any links that aren't named?
Upvotes: 0
Views: 288
Reputation: 99331
There seem to be a couple of missing href
links in the html. Because xmlGetAttr()
returns NULL
when there is no requested attribute, you could find them with is.null()
. Then you could put that into an if()
condition to include an empty character string for the ones that are missing, and the href
attribute otherwise. There is no need to subset the root node.
library(XML)
## parse the html document
doc <- htmlParse("http://stackoverflow.com/")
## use the [.XMLNode accessor to drop into 'a' and then apply our functions
getvals <- lapply(doc["//a"], function(x) {
data.frame(
## get the xml value
Name = xmlValue(x, trim = TRUE),
## get the href link if it exists
HREF = if(is.null(att <- xmlGetAttr(x, "href"))) "" else att,
stringsAsFactors = FALSE
)
})
## create the full data frame
df <- do.call(rbind, getvals)
## have a look
str(df)
# 'data.frame': 697 obs. of 2 variables:
# $ Name: chr "current community" "chat" "Stack Overflow" "Meta Stack Overflow" ...
# $ HREF: chr "//stackoverflow.com" "http://chat.stackoverflow.com" "//stackoverflow.com" "http://meta.stackoverflow.com" ...
tail(df)
# Name HREF
# 692 Stack Apps //stackapps.com
# 693 Meta Stack Exchange //meta.stackexchange.com
# 694 Area 51 //area51.stackexchange.com
# 695 Stack Overflow Careers //careers.stackoverflow.com
# 696 cc by-sa 3.0 http://creativecommons.org/licenses/by-sa/3.0/
# 697 attribution required http://blog.stackoverflow.com/2009/06/attribution-required/
Upvotes: 1
Reputation: 783
My goal was to look at all the link names and then determine which URL I needed. I didn't find a way to get the data frame I wanted with everything, but what I can do is get all link names like this
MyURL <- "http://stackoverflow.com/"
MyPage <- htmlParse(MyURL) # Parse the web page
URLroot <- xmlRoot(MyPage) # Get root node
URL_Links_names <- xpathSApply(URLroot, "//a", xmlValue) # Get all href values
That gets my all of the link names. Search through the names and determine if you want some or all of them, and then you can pass the link names to this function to get the HREF values of each link based on the link name
GetLinkURLByName <- function(LinkName, WebPageURL) {
LinkURL <- getHTMLLinks(WebPageURL, xpQuery = sprintf("//a[text()='%s']/@href",LinkName))
return(LinkURL)
}
LinkName = the name of the link from URL_Links_Name. WebPageURL = the web page you are scraping(in this example I would pass it MyURL)
Upvotes: 0