Reputation: 437

parse multiple XML files based on a vector and rbind in a dataframe

With some effort and help from the stackers, I have been able to parse a webpage and save it as a dataframe. I want to repeat the same operation on multiple xml files and rbind the list. Here is what I tried and did successfully:

library(XML)    
xml.url <- "http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml"
doc <- xmlParse(xml.url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL 
x_t <- t(x) 
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))

Above code works well, now when I try to apply a function to do the same for multiple xml files :

ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")

xml_url_test =    as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml",
              ERS_ID))

XML_parser <- function(XML_url){
doc <- xmlParse(XML_url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL 
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
return(x_t)
}

major_test <- sapply(xml_url_test, XML_parser)

It works, but gives me a long list that is not in the right data frame format as I generated for the single XML file. Finally I would like to also add a column to the final dataframe that has the ERS number from the ERS_ID vector Something like x_t$ERSid <- ERS_ID in the function

Can someone point out what am I missing in the function as well as any better ways to do the task?

Thanks!

Upvotes: 1

Answers (4)

Chris S.

Reputation: 2225

You can retrieve multiple IDs with a single REST url using a comma-separated list or range like ERS445758-ERS445762 and avoid multiple queries to the ENA.

This code gets all 5 samples into a node set and then applies functions using a leading dot in the xpath string so its relative to that node.

ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")
url <- paste0( "http://www.ebi.ac.uk/ena/data/view/", paste(ERS_ID, collapse=","), "&display=xml")  
doc <- xmlParse(url)
samples <- getNodeSet( doc, "//SAMPLE")
## check the first node
samples[[1]]
## get the sample attribute node set and apply xmlToDataFrame to that  
x <- lapply( lapply(samples, getNodeSet,  ".//SAMPLE_ATTRIBUTE"), xmlToDataFrame)
# labels for bind_rows
names(x) <- sapply(samples, xpathSApply, ".//PRIMARY_ID", xmlValue)  
library(dplyr)
y <- bind_rows(x, .id="sample")

z <- subset(y, TAG %in% c("age","sex","body site","body-mass index") , 1:3)
       sample             TAG         VALUE
15  ERS445758             age            28
16  ERS445758             sex          male
17  ERS445758       body site Sigmoid colon
19  ERS445758 body-mass index    16.9550173
50  ERS445759             age            58
51  ERS445759             sex          male
...

library(tidyr)
z %>% spread( TAG, VALUE)
     sample age     body site body-mass index    sex
1 ERS445758  28 Sigmoid colon      16.9550173   male
2 ERS445759  58 Sigmoid colon     23.22543185   male
3 ERS445760  26 Sigmoid colon     20.76124567 female
4 ERS445761  30 Sigmoid colon               0   male
5 ERS445762  36 Sigmoid colon               0   male

Upvotes: 0

alistaire

Reputation: 43334

purrr is really helpful here, as you can iterate over a vector of URLs or a list of XML files with map, or within nested elements with at_depth, and simplify the results with the *_df forms and flatten.

library(tidyverse)
library(xml2)

# be kind, don't call this more times than you need to
x <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762") %>% 
    sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", .) %>% 
    map(read_xml)    # read each URL into a list item

df <- x %>% map(xml_find_all, '//SAMPLE_ATTRIBUTE') %>%     # for each item select nodes
    at_depth(2, as_list) %>%     # convert each (nested) attribute to list
    map_df(map_df, flatten)    # flatten items, collect pages to df, then all to one df

df
## # A tibble: 175 × 3
##                       TAG                               VALUE UNITS
##                     <chr>                               <chr> <chr>
## 1      investigation type                          metagenome  <NA>
## 2            project name                                BMRP  <NA>
## 3     experimental factor                          microbiome  <NA>
## 4             target gene                            16S rRNA  <NA>
## 5      target subfragment                                V1V2  <NA>
## 6             pcr primers                            27F-338R  <NA>
## 7   multiplex identifiers                          TGATACGTCT  <NA>
## 8       sequencing method                      pyrosequencing  <NA>
## 9  sequence quality check                            software  <NA>
## 10          chimera check ChimeraSlayer; Usearch 4.1 database  <NA>
## # ... with 165 more rows

Upvotes: 1

Parfait

Reputation: 107567

Your main issue is using sapply over lapply() where the latter returns a list and former attempts to simplify to a vector or matrix, here being a matrix.

major_test <- lapply(xml_url_test, XML_parser)

Of course, sapply is a wrapper for lapply and can also return a list: sapply(..., simplify=FALSE):

major_test <- sapply(xml_url_test, XML_parser, simplify=FALSE)

However, a few other items came up:

At beginning, you are not concatenating your ERS_ID to the url stem with sprintf's %s operator. So right now, the same urls are repeating.
At end, you are not binding your list of data frames into a compiled final single dataframe.
Add new ERS column inside your defined function, passing in ERS_ID vector. And while creating column, also remove the ERS prefix with gsub.

R code (adjusted)

XML_parser <- function(eid) {
  XML_url <- as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", eid))
  doc <- xmlParse(XML_url)
  x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
  x$UNITS <- NULL 
  x_t <- t(x)
  x_t <- as.data.frame(x_t)
  names(x_t) <- as.matrix(x_t[1, ])
  x_t <- x_t[-1, ]
  x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
  x_t$ERSid <- gsub("ERS", "", eid)         # ADD COL, REMOVE ERS
  x_t <- x_t[,c(ncol(x_t),2:ncol(x_t)-1)]   # MOVE NEW COL TO FIRST
  return(x_t)
}

major_test <- lapply(ERS_ID, XML_parser)
# major_test <- sapply(ERS_ID, XML_parser, simplify=FALSE)

# BIND DATA FRAMES TOGETHER
finaldf <- do.call(rbind, major_test)
# RESET ROW NAMES
row.names(finaldf) <- seq(nrow(finaldf))

Upvotes: 3

Rentrop

Reputation: 21497

Using xml2 and the tidyverse you can do something like this:

require(xml2)
require(purrr)
require(tidyr)
urls <- rep("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml", 2)
identifier <- LETTERS[seq_along(urls)] # Take a unique identifier per url here

parse_attribute <- function(x){
  out <- data.frame(tag = xml_text(xml_find_all(x, "./TAG")),
             value = xml_text(xml_find_all(x, "./VALUE")), stringsAsFactors = FALSE)
  spread(out, tag, value)
}

doc <- map(urls, read_xml)

out <- doc %>% 
  map(xml_find_all, "//SAMPLE_ATTRIBUTE") %>% 
  set_names(identifier) %>%
  map_df(parse_attribute, .id="url")

Which gives you a 2x36 data.frame. To parse the column type i would suggest using readr::type_convert(out)

Out looks as follows:

  url age body product     body site body-mass index                       chimera check collection date
1   A  28       mucosa Sigmoid colon        16.95502 ChimeraSlayer; Usearch 4.1 database      2009-03-16
2   B  28       mucosa Sigmoid colon        16.95502 ChimeraSlayer; Usearch 4.1 database      2009-03-16
  disease status ENA-BASE-COUNT ENA-CHECKLIST ENA-FIRST-PUBLIC ENA-LAST-UPDATE ENA-SPOT-COUNT
1      remission         627051     ERC000015       2014-12-31      2016-10-21           1668
2      remission         627051     ERC000015       2014-12-31      2016-10-21           1668
          environment (biome)       environment (feature) environment (material) experimental factor
1 organism-associated habitat organism-associated habitat                  mucus          microbiome
2 organism-associated habitat organism-associated habitat                  mucus          microbiome
  gastrointestinal tract disorder geographic location (country and/or sea,region) geographic location (latitude)
1              Ulcerative Colitis                                           India                       72.82807
2              Ulcerative Colitis                                           India                       72.82807
  geographic location (longitude) host subject id human gut environmental package investigation type
1                        18.94084               1                       human-gut         metagenome
2                        18.94084               1                       human-gut         metagenome
                           medication multiplex identifiers pcr primers    phenotype project name
1 ASA;Steroids;Probiotics;Antibiotics            TGATACGTCT    27F-338R pathological         BMRP
2 ASA;Steroids;Probiotics;Antibiotics            TGATACGTCT    27F-338R pathological         BMRP
  sample collection device or method sequence quality check sequencing method sequencing template  sex target gene
1                             biopsy               software    pyrosequencing                 DNA male    16S rRNA
2                             biopsy               software    pyrosequencing                 DNA male    16S rRNA
  target subfragment
1               V1V2
2               V1V2

Upvotes: 1

parse multiple XML files based on a vector and rbind in a dataframe

Answers (4)

Related Questions