Reputation: 267
I have an XML file of the following format:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fishmeter>
<mission cruise="2019114" station="344" platform="4174">
<fishstation serialno="7">
<platform>4174</platform>
<nation>58</nation>
<latitudestart>60.746062433333336</latitudestart>
<longitudestart>2.6755209333333334</longitudestart>
<latitudeend>60.75632006666667</latitudeend>
<longitudeend>2.64776135</longitudeend>
<catchsample species="172414" samplenumber="1" noname="makrell" aphia="127023">
<conservation>1</conservation>
<producttype>1</producttype>
<weight>10.195</weight>
<count>0</count>
<lengthsampleweight>0</lengthsampleweight>
<sampleproducttype>1</sampleproducttype>
<lengthmeasurement>E</lengthmeasurement>
<specimensamplecount>36</specimensamplecount>
<individual specimenno="1">
<lengthunit>2</lengthunit>
<length>0.36</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="2">
<lengthunit>2</lengthunit>
<length>0.36</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="3">
<lengthunit>2</lengthunit>
<length>0.315</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="4">
<lengthunit>2</lengthunit>
<length>0.315</length>
<individualproducttype>1</individualproducttype>
</individual>
</catchsample>
<catchsample species="167044" samplenumber="1" noname="knurr" aphia="150637">
<conservation>1</conservation>
<producttype>1</producttype>
<weight>2.52</weight>
<count>0</count>
<lengthsampleweight>0</lengthsampleweight>
<sampleproducttype>1</sampleproducttype>
<lengthmeasurement>E</lengthmeasurement>
<specimensamplecount>10</specimensamplecount>
<individual specimenno="1">
<lengthunit>2</lengthunit>
<length>0.28</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="2">
<lengthunit>2</lengthunit>
<length>0.285</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="3">
<lengthunit>2</lengthunit>
<length>0.37</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="4">
<lengthunit>2</lengthunit>
<length>0.315</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="5">
<lengthunit>2</lengthunit>
<length>0.32</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="6">
<lengthunit>2</lengthunit>
<length>0.38</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="7">
<lengthunit>2</lengthunit>
<length>0.39</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="8">
<lengthunit>2</lengthunit>
<length>0.305</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="9">
<lengthunit>2</lengthunit>
<length>0.24</length>
<individualproducttype>1</individualproducttype>
</individual>
<individual specimenno="10">
<lengthunit>2</lengthunit>
<length>0.36</length>
<individualproducttype>1</individualproducttype>
</individual>
</catchsample>
</fishstation>
</mission>
</fishmeter>
I am trying to extract the individual
nodes as rows into a data frame retaining the information form the parent catchsample
and grandparent fishstation
nodes in additional columns, so that the resulting data frame has all following columns:
cruise, station, platform, serialno, platform, nation, latitudestart, longitudestart, latitudeend, longitudeend, species, samplenumber, noname, aphia, conservation, producttype, weight, count, lengthsampleweight, sampleproducttype, lengthmeasurement, specimensamplecount, specimenno, lengthunit, length, individualproducttype
Following the answer from R XML - combining parent and child nodes into data frame I manage to extract the individual
node data into a data frame, however, not the related info from the higher-level nodes.
fish<- read_xml('test.xml') %>%
xml_find_all('//individual') %>%
map_dfr(~flatten(c(xml_attrs(.x),
map(xml_children(.x),
~set_names(as.list(xml_text(.x)), xml_name(.x)))))) %>%
type_convert()
# A tibble: 14 x 4
specimenno lengthunit length individualproducttype
<dbl> <dbl> <dbl> <dbl>
1 1 2 0.36 1
2 2 2 0.36 1
3 3 2 0.315 1
4 4 2 0.315 1
5 1 2 0.28 1
6 2 2 0.285 1
7 3 2 0.37 1
8 4 2 0.315 1
9 5 2 0.32 1
10 6 2 0.38 1
11 7 2 0.39 1
12 8 2 0.305 1
13 9 2 0.24 1
14 10 2 0.36 1
Upvotes: 0
Views: 442
Reputation: 17729
You could do this:
library(xml2)
library(purrr)
library(readr)
library(rvest)
library(tibble)
individuals <- read_xml('test.xml') %>%
xml_find_all('//individual')
to_add <- function(individual, xpath) individual %>%
html_nodes(xpath = xpath) %>%
{list(html_text(.), html_name(.))} %>%
{setNames(object = .[[1]], nm = .[[2]])}
get_data <- function(individual){
out <- c(
individual %>% html_attrs(),
individual %>% html_nodes(xpath = "..") %>% html_attrs() %>% unlist,
individual %>% html_nodes(xpath = "../..") %>% html_attrs() %>% unlist,
individual %>% html_nodes(xpath = "../../..") %>% html_attrs() %>% unlist
)
xpathes <- c("../../*[not(descendant::*)]", "../*[not(descendant::*)]", "*")
c(sapply(xpathes, to_add, individual = individual, USE.NAMES = FALSE) %>% unlist, out)
}
And then:
lapply(individuals, get_data) %>%
do.call(what = cbind) %>%
as.tibble
Upvotes: 2