mickmars51
mickmars51

Reputation: 1

Extracting author and affiliation from xml file retrieved using rentrez

I was following this post's code: https://quantixed.org/2021/04/04/ten-years-vs-the-spread-ii-calculating-publication-lag-times-in-r/ and was amazed at the ability to output received, accepted and published dates/gaps between them. Would there be a way to get any of the following:

-number of authors (could write a counter for separators on this one to be fair) -first author affiliation -last author affiliation -number of citations per article -degree of the first author

Or to see the full output of what is able to be pulled? What I tried so far:

In grabbing the first and last authors after the database printed all authors this sufficed: theData$authLast <- sapply(strsplit(theData$authors, "|", fixed=TRUE), tail, 1) theData$authFirst <- sapply(strsplit(theData$authors, "|", fixed=TRUE), head, 1)

however, when trying to get author affiliations the following gives me all affiliations: authAffil <- lapply(records, xpathSApply, ".//Author/AffiliationInfo", xmlValue) authAffil[sapply(authAffil, is.list)] <- NA authAffil <- sapply(authAffil, paste, collapse = "|")

Any direction in how to get the first author, affiliation, last author, affiliation into four columns from the database or other metrics listed above would be helpful. Thank you!

Edit: tried to make a reprex, let me know if this counts as a minimal reproducible example. thank you for the suggestion Ric Villalba!

#load in packages
library(reprex)
library(devtools)
#> Loading required package: usethis
install_github("ropensci/rentrez")
#> Skipping install of 'rentrez' from a github remote, the SHA1 (a225f213) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(rentrez)
require(XML)
#> Loading required package: XML
require(ggplot2)
#> Loading required package: ggplot2
require(ggridges)
#> Loading required package: ggridges
require(gridExtra)
#> Loading required package: gridExtra
# search pubmed using a search term (use_history allows retrieval of all records)
pp <- entrez_search(db="pubmed", term="cell[ta] AND 2010 : 2021[pdat] AND (journal article[pt] NOT review[pt] NOT comment[pt]
                    NOT autobiography[pt] NOT biography[pt] NOT case reports[pt] NOT clinical trial[pt]
                    NOT historical article[pt] NOT comparative study[pt] NOT evaluation study[pt]
                    NOT evaluation study[pt] NOT introductory journal article[pt])", use_history = TRUE)
pp_rec <- entrez_fetch(db="pubmed", web_history=pp$web_history, rettype="xml", parsed=TRUE)
# save records as XML file
saveXML(pp_rec, file = "Data/records.xml")
#> Error in saveXML(pp_rec, file = "Data/records.xml"): cannot create file Data/records.xml. Check the directory exists and permissions are appropriate
filename <- "~/Data/records.xml"
## extract a data frame from XML file
## This is modified from christopherBelter's pubmedXML R code
extract_xml <- function(theFile) {
  library(XML)
  newData <- xmlParse(theFile)
  records <- getNodeSet(newData, "//PubmedArticle")
  pmid <- xpathSApply(newData,"//MedlineCitation/PMID", xmlValue)
  doi <- lapply(records, xpathSApply, ".//ELocationID[@EIdType = \"doi\"]", xmlValue)
  doi[sapply(doi, is.list)] <- NA
  doi <- unlist(doi)
  authLast <- lapply(records, xpathSApply, ".//Author/LastName", xmlValue)
  authLast[sapply(authLast, is.list)] <- NA
  authInit <- lapply(records, xpathSApply, ".//Author/Initials", xmlValue)
  authInit[sapply(authInit, is.list)] <- NA
  authors <- mapply(paste, authLast, authInit, collapse = "|")
  authAffil <- lapply(records, xpathSApply, ".//Author/AffiliationInfo", xmlValue)
  authAffil[sapply(authAffil, is.list)] <- NA
  authAffil <- sapply(authAffil, paste, collapse = "|")
  theDF <- data.frame(pmid, doi, authors,authAffil, stringsAsFactors = FALSE)
  
  return(theDF)
}
#extract into a dataframe
theData <- extract_xml(filename)
#show the author affiliations as bunched
print(theData$authAffil[1])
#> [1] "Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA. Electronic address: [email protected].|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Division of Infectious Diseases, Massachusetts General Hospital, Boston, MA 02114, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Faculty of Arts and Sciences, Harvard University, Cambridge, MA 02138, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA.|Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA; Applied Epidemiology Fellowship, Council of State and Territorial Epidemiologists, Atlanta, GA 30345, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Barnstable County Department of Health and the Environment, Barnstable, MA 02630, USA.|Barnstable County Department of Health and the Environment, Barnstable, MA 02630, USA.|Barnstable County Department of Health and the Environment, Barnstable, MA 02630, USA.|Barnstable County Department of Human Services, Barnstable, MA 02630, USA.|Community Tracing Collaborative, Commonwealth of Massachusetts, Boston, MA 02199, USA.|Community Tracing Collaborative, Commonwealth of Massachusetts, Boston, MA 02199, USA.|Community Tracing Collaborative, Commonwealth of Massachusetts, Boston, MA 02199, USA.|Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Massachusetts Consortium for Pathogen Readiness, Boston, MA 02115, USA. Electronic address: [email protected].|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA; Massachusetts Consortium for Pathogen Readiness, Boston, MA 02115, USA."

Created on 2022-11-05 with reprex v2.0.2

Upvotes: 0

Views: 378

Answers (1)

quantixed
quantixed

Reputation: 352

In the code that you posted the extract_xml() function will pull out information from a large xml file retrieved using rentrez. Using the logic in your question you can get four columns of first author, affiliation, last author, affiliation like this:

theData$authFirst <- sapply(strsplit(theData$authors, "|", fixed=TRUE), head, 1)
theData$affilFirst <- sapply(strsplit(theData$authAffil, "|", fixed=TRUE), head, 1)
theData$authLast <- sapply(strsplit(theData$authors, "|", fixed=TRUE), tail, 1) 
theData$affilLast <- sapply(strsplit(theData$authAffil, "|", fixed=TRUE), tail, 1)

This will append four columns to the data frame called theData which was created in your reprex.

Upvotes: 0

Related Questions