Reputation: 17
I need to extract a large number of XML sitemap elements from multiple xml files using Rvest. I have been able to extract html_nodes from webpages using xpaths, but for xml files this is new to me.
And, I can't find a Stackoverflow question that lets me parse an xml file address, rather than parsing a large text chunk of XML.
Example of what I have used for html:
webpage <- ""
data <- webpage %>%
read_html() %>%
html_nodes("any given node goes here") %>%
How do I adapt this to take a "loc" XML file element from an XML file (parsing the address) that looks like this:
Here is what I have changed in the script kindly provided by Dave:
#list of files to process
dfs<-lapply(fnames, function(fname) {
#find loc and lastmod
loc<-trimws(xml_text(xml_find_all(doc, ".//loc")))
lastmod<-trimws(xml_text(xml_find_all(doc, ".//last")))
#find all of the nodes/records under the urlset node
nodes<-xml_children(xml_find_all(doc, ".//urlset"))
#find the sub nodes names and values
#make data frame of all the values
df<-data.frame(file=fname, loc=loc, lastmod=lastmod, node.names=nodenames,
values=nodevalues, stringsAsFactors = FALSE, nrow(0))
#Make one long df
longdf<, dfs)
#make into a wide format
finalanswer<-spread(longdf, key=node.names, value=values)
Upvotes: 1
Views: 2050
Reputation: 65
I have this code i write some time ago to check all the XML in a file and collect specific nodes of a pattern of XML, with a little tweak you can use something maybe.
dir <- dir()
for(i in 1:length(dir)){
visitNode <- function(node) {#Recursive Function to visit the XML tree (depth first)
if (is.null(node)) {#leaf node reached. Turn back
print(paste("Node: ", xmlName(node)))
num.children = xmlSize(node)
if(num.children == 0 ) {# Add your code to process the leaf node here
print( paste(" ", xmlValue(node)))
if (num.children > 0){#Go one level deeper
for (i in 1 : num.children) {
visitNode(node[[i]][["NFe"]]) #the i-th child of node
xmlfile <- dir[i]
xtree <- xmlInternalTreeParse(xmlfile)
root <- xmlRoot(xtree)
dataxml <- visitNode(root)
dataxml <- xmlToList(root)
df<-$NFe$infNFe$infAdic$infCpl), nrow=length(dataxml$NFe$infNFe$infAdic$infCpl),byrow=TRUE))
Upvotes: 0
Reputation: 24079
Since the number of children per url node is different is a working approach:
#find parent nodes
parents <-xml_find_all(file, ".//url")
#parse each child
dfs<-lapply(parents, function(node){
#Find all children
nodes <- xml_children(node)
#get node name and value
nodenames<- xml_name(nodes)
values <- xml_text(nodes)
#made data frame with results
df<-, stringsAsFactors=FALSE)
#Make find answer
Since you have multiple files, you could enclose the script in an outer loop to cycle the through the file list. Of course is a loop within a loop thus performance will suffer if there is a large number of files and a large number of parent nodes in each file.
Alternative: If the number of children nodes are short then it is best to parse them directly and avoid the above lapply loop.
loc <- xml_find_first(parents, ".//loc") %>% xml_text()
lastmod <- xml_find_first(parents, ".//lastmod") %>% xml_text()
changefreq <- xml_find_first(parents, ".//changefreq") %>% xml_text()
priority <- xml_find_first(parents, ".//priority") %>% xml_text()
answer<-data.frame(loc, lastmod, chargefreq, priority)
Upvotes: 1