Reputation: 11

Extracting a list from PubMed XML in R and adding to data frame/tibble

I am trying to extract a few bits of information from XML using R and then put them into a data frame to export as a csv. The XML is coming from PubMed records and I'm using the rentrez package to query. There will be thousands of records, though I'm only working with two while I try to figure it out.

For each pubmed record, I want to extract PMID, indexing method, last updated date, and a list of all the mesh terms (these last ones ideally would be a string in one "cell"). The current code I have seems to be adding all the mesh terms from each record together and adding the same list to each row.

This is the code I have so far (I am very new at this, so it may be a bit frankenstein-y).

library(xml2) 
library(XML)
library(data.table)
library(rentrez)
library(dplyr) 

#get xml (currently, just two articles)
testa <- entrez_search(db="pubmed", term="38863400 OR 32029379", use_history=TRUE)
testa_rec <- entrez_fetch(db="pubmed", id=testa$ids, rettype="xml", parsed=TRUE)

# Function to extract the required information
extract_info <- function(articles_nodes) { 
indexing <- xpathApply(testa_rec, "//MedlineCitation", xmlGetAttr, "IndexingMethod")
date <- xpathApply(testa_rec, "//DateRevised", xmlValue)
pmid <- xpathApply(testa_rec, "//PMID", xmlValue)
mesh <- xpathSApply(testa_rec, "//MeshHeading", xmlValue)

#check for null (though this might be unnecessary)
if (is.null(pmid)) pmid <- NA 
if (is.null(indexing)) indexing <- NA 
if (is.null(date)) date <- NA
if (is.null(mesh)) mesh <- NA

#add to tibble
tibble(
pmid = pmid, 
indexing = indexing,
date = date, 
mesh = paste(mesh, collapse = ";")
)
}

#get set of nodes
articles_nodes <- getNodeSet(testa_rec, "//MedlineCitation")

# Apply the extract_info function to each MedlineCitation node
results_list <- lapply(articles_nodes, extract_info)

# Bind the results into a single tibble
final_results <- bind_rows(results_list)

#get csv of results (only method that seemed to work)
fwrite(final_results, file ="myDT.csv")
 here

The result I get is a 4x4 matrix, instead of just two lines - with the two PMIDs repeated for some reason. The information is correct, but the list of mesh terms includes terms from both PMIDs instead of each list being in its own row. I'm assuming this is a logic problem related to applying the function output to the results_list, but I'm at a loss for what it could be.

This is the csv output:

pmid,indexing,date,mesh
38863400,Automated,20240615,"Humans;Ethiopiaepidemiology;Female;Adult;Adolescent;Obesityepidemiology;Middle Aged;Overweightepidemiology;Young Adult;Prevalence;Health Surveys;Socioeconomic Factors;Risk Factors;Age Factors;Abortion, Inducedmethodstrends;Female;Gestational Age;Humans;Pregnancy;Women's Health"
32029379,Curated,20200306,"Humans;Ethiopiaepidemiology;Female;Adult;Adolescent;Obesityepidemiology;Middle Aged;Overweightepidemiology;Young Adult;Prevalence;Health Surveys;Socioeconomic Factors;Risk Factors;Age Factors;Abortion, Inducedmethodstrends;Female;Gestational Age;Humans;Pregnancy;Women's Health"
38863400,Automated,20240615,"Humans;Ethiopiaepidemiology;Female;Adult;Adolescent;Obesityepidemiology;Middle Aged;Overweightepidemiology;Young Adult;Prevalence;Health Surveys;Socioeconomic Factors;Risk Factors;Age Factors;Abortion, Inducedmethodstrends;Female;Gestational Age;Humans;Pregnancy;Women's Health"
32029379,Curated,20200306,"Humans;Ethiopiaepidemiology;Female;Adult;Adolescent;Obesityepidemiology;Middle Aged;Overweightepidemiology;Young Adult;Prevalence;Health Surveys;Socioeconomic Factors;Risk Factors;Age Factors;Abortion, Inducedmethodstrends;Female;Gestational Age;Humans;Pregnancy;Women's Health"

Upvotes: 1

Answers (3)

G. Grothendieck

Reputation: 269905

Try the puremoe package. (Other packages that work with pubmed are RefManageR, RISmed, pubmed.miner, easyPubMed and pubmedR.)

library(puremoe)
out <- get_records(c(38863400, 32029379), "pubmed_abstracts")

str(out)

giving

Classes ‘data.table’ and 'data.frame':  2 obs. of  6 variables:
 $ pmid        : chr  "38863400" "32029379"
 $ year        : chr  "2024" "2020"
 $ journal     : chr  "Global health action" "Best practice & research. Clinical obstetrics & gynaecology"
 $ articletitle: chr  "Overweight and obesity trends and associated factors among reproductive women in Ethiopia." "Modern methods to induce abortion: Safety, efficacy and choice."
 $ abstract    : chr  "In low- and middle-income countries, the double burden of malnutrition is prevalent. Many countries in Africa a"| __truncated__ "Abortion care is a fundamental part of women's reproductive health care. Surgical and medical abortion methods "| __truncated__
 $ annotations :List of 2
  ..$ :'data.frame':    20 obs. of  3 variables:
  .. ..$ pmid: chr  "38863400" "38863400" "38863400" "38863400" ...
  .. ..$ type: chr  "MeSH" "MeSH" "MeSH" "MeSH" ...
  .. ..$ form: chr  "Humans" "Ethiopia" "Female" "Adult" ...
  ..$ :'data.frame':    8 obs. of  3 variables:
  .. ..$ pmid: chr  "32029379" "32029379" "32029379" "32029379" ...
  .. ..$ type: chr  "MeSH" "MeSH" "MeSH" "MeSH" ...
  .. ..$ form: chr  "Abortion, Induced" "Female" "Gestational Age" "Humans" ...
 - attr(*, ".internal.selfref")=<externalptr>

Upvotes: 1

the-mad-statter

Reputation: 8861

The duplication was happening because you ended up mixing and matching vectorized and non-vectorized code.

Your extract_info() function was coded to read from testa_rec as opposed to the value that got passed to the function in the argument articles_nodes.

Because most, if not all, of the functions in extract_info() are vectorized the function could operate on all of testa_rec as is, and you essentially called the function twice with your lapply().

Here is a version that reads each article node as I think you intended. Because we are now reading each node individually, the paths needed to be changed/updated. You may need to confirm these new paths work as intended across all articles you intend to query.

library(dplyr)
library(rentrez)
library(XML)

#get xml (currently, just two articles)
testa <- entrez_search(db="pubmed", term="38863400 OR 32029379", use_history=TRUE)
testa_rec <- entrez_fetch(db="pubmed", id=testa$ids, rettype="xml", parsed=TRUE)

# Function to extract the required information
extract_info <- function(article_node) { 
  indexing <- xpathApply(article_node, ".", xmlGetAttr, "IndexingMethod")[[1]]
  date <- xpathApply(article_node, "./DateRevised", xmlValue)[[1]]
  pmid <- xpathApply(article_node, "./PMID", xmlValue)[[1]]
  mesh <- xpathSApply(article_node, "./MeshHeadingList/MeshHeading", xmlValue)

  #check for null (though this might be unnecessary)
  if (is.null(pmid)) pmid <- NA 
  if (is.null(indexing)) indexing <- NA 
  if (is.null(date)) date <- NA
  if (is.null(mesh)) mesh <- NA

  #add to tibble
  tibble(
    pmid = pmid, 
    indexing = indexing,
    date = date, 
    mesh = paste(mesh, collapse = ";")
  )
}

#get set of nodes
articles_nodes <- getNodeSet(testa_rec, "//MedlineCitation")

# Apply the extract_info function to each MedlineCitation node
results_list <- lapply(articles_nodes, extract_info)

# Bind the results into a single tibble
final_results <- bind_rows(results_list)

final_results
#> # A tibble: 2 × 4
#>   pmid     indexing  date     mesh                                              
#>   <chr>    <chr>     <chr>    <chr>                                             
#> 1 38863400 Automated 20240615 Humans;Ethiopiaepidemiology;Female;Adult;Adolesce…
#> 2 32029379 Curated   20200306 Abortion, Inducedmethodstrends;Female;Gestational…

^{Created on 2024-07-02 with reprex v2.1.0.9000}

_{Reprex files hosted with on GitHub}

Upvotes: 0

Till

Reputation: 6663

Here is a version of extract_info() that builds on yours and extracts info for one or more entries. I use xmlToList() to get the mesh headings out.

library(XML)
library(rentrez)
library(tibble)

testa <- entrez_search(db = "pubmed", term = "38863400 OR 32029379", use_history = TRUE)
testa_rec <- entrez_fetch(db = "pubmed", id = testa$ids, rettype = "xml", parsed = TRUE)

# Function to extract the required information
extract_info <- function(articles_nodes) {
  indexing <- xpathApply(articles_nodes, "//MedlineCitation", xmlGetAttr, "IndexingMethod")
  date <- xpathApply(articles_nodes, "//DateRevised", xmlValue)
  pmid <- xpathApply(articles_nodes, "//PMID", xmlValue)
  mesh <- lapply(xmlToList(articles_nodes), \(x) x$MedlineCitation$MeshHeadingList |> sapply(
    \(y) y$DescriptorName$text) |> paste(collapse = "; "))

  tibble(
    pmid = unlist(pmid),
    indexing = unlist(indexing),
    date = unlist(date),
    mesh = unlist(mesh)
  )
}

extract_info(testa_rec)
#> # A tibble: 2 × 4
#>   pmid     indexing  date     mesh                                              
#>   <chr>    <chr>     <chr>    <chr>                                             
#> 1 38863400 Automated 20240615 Humans; Ethiopia; Female; Adult; Adolescent; Obes…
#> 2 32029379 Curated   20200306 Abortion, Induced; Female; Gestational Age; Human…

Upvotes: 1

Extracting a list from PubMed XML in R and adding to data frame/tibble

Answers (3)

Related Questions