Reputation: 6039
I need to extract certain data from XML that looks like this (simplified for brevity)
<Doc name="Doc1">
<Lists Count="1">
<List Name="List1">
<Points Count="3">
<Point Id="1">
<Tags Count ="1">"a"</Tags>
<Point Position="1" />
</Point>
<Point Id="2">
<Point Position="2" />
</Point>
<Point Id="3">
<Tags Count="1">"c"</Tags>
<Point Position="3" />
</Point>
</Points>
</List>
</Lists>
</Doc>
The output should be a data frame that matches a tag and position to each point Id
Point Tag Position
1 1 a 1
2 2 <NA> 2
3 3 c 3
I am new to XML, I was playing with xml2 package. So far, I could extract each variable separately, but since some points may not have a Tag data , I can't find a way to match between the three parameters.
> library(xml2)
> xml_data<-read_xml(...)
> xml_data %>% xml_find_all("//Point") %>% xml_attr("Id")
[1] "1" "2" "3"
> xml_data %>% xml_find_all("//Vertical") %>% xml_attr("Position")
[1] "1" "2" "3"
> xml_data %>% xml_find_all("//Tags") %>% xml_text()
[1] "\"a\"" "\"c\""
Upvotes: 3
Views: 7271
Reputation: 269556
Run xpathApply
over the //Points/Point
nodes and for each such node x
get the Id
, the Tag
(or NA if none) and the Position
:
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
do.call("rbind", xpathApply(doc, "//Points/Point", function(x)
data.frame(Id = as.numeric(xmlAttrs(x)[["Id"]]),
Tags = c(gsub('"', '', xmlValue(x[["Tags"]])), NA)[[1]],
Position = as.numeric(xmlAttrs(x[["Point"]])[["Position"]],
stringsAsFactors = FALSE))))
giving:
Id Tags Position
1 1 a 1
2 2 <NA> 2
3 3 c 3
variation
A variation on the above giving the same answer is this. At each Point
node it creates a string of the attributes and values and then uses read.table
to read it in:
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
xp <- xpathSApply(doc, "//Points/Point", function(x) paste(
xmlAttrs(x)[["Id"]],
c(gsub('"', '', xmlValue(x[["Tags"]])), NA)[[1]],
xmlAttrs(x[["Point"]])[["Position"]]))
read.table(text = xp, col.names = c("Id", "Tags", "Position"), as.is = TRUE)
Note: Input Lines
is:
Lines <- '<Doc name="Doc1">
<Lists Count="1">
<List Name="List1">
<Points Count="3">
<Point Id="1">
<Tags Count ="1">"a"</Tags>
<Point Position="1" />
</Point>
<Point Id="2">
<Point Position="2" />
</Point>
<Point Id="3">
<Tags Count="1">"c"</Tags>
<Point Position="3" />
</Point>
</Points>
</List>
</Lists>
</Doc>'
Upvotes: 1
Reputation: 78792
purrr
and xml2
go well together:
library(xml2)
library(purrr)
txt <- '<Doc name="Doc1">
<Lists Count="1">
<List Name="List1">
<Points Count="3">
<Point Id="1">
<Tags Count ="1">"a"</Tags>
<Point Position="1" />
</Point>
<Point Id="2">
<Point Position="2" />
</Point>
<Point Id="3">
<Tags Count="1">"c"</Tags>
<Point Position="3" />
</Point>
</Points>
</List>
</Lists>
</Doc>'
doc <- read_xml(txt)
xml_find_all(doc, ".//Points/Point") %>%
map_df(function(x) {
list(
Point=xml_attr(x, "Id"),
Tag=xml_find_first(x, ".//Tags") %>% xml_text() %>% gsub('^"|"$', "", .),
Position=xml_find_first(x, ".//Point") %>% xml_attr("Position")
)
})
## # A tibble: 3 × 3
## Point Tag Position
## <chr> <chr> <chr>
## 1 1 a 1
## 2 2 <NA> 2
## 3 3 c 3
Upvotes: 5