Reputation: 11686
I'm trying to extract PMID values which are pubmed journal identifiers. A typical one looks like: <PMID Version=\"1\">30556505</PMID>
I extract that with:
strapplyc(startingString, "<PMID Version=\"1\">(.*?)</PMID>", simplify = c)
The reason I use strapplyc
as there could be several of those PMID values in the xml string. However, some of them I do not want, specifically those wrapped in a comments/correction tag (example):
<CommentsCorrectionsList> <CommentsCorrections RefType=\"CommentIn\"> <RefSource>Gastroenterology. 2019 Feb;156(3):545-546</RefSource> <PMID Version=\"1\">30641052</PMID> </CommentsCorrections> </CommentsCorrectionsList>
How would be the regular expression need to be changed to ignore those in the CommentsCorrectionsList tag?
The packages are:
gsubfn
for strapplyc
Upvotes: 0
Views: 84
Reputation: 269744
If we have a well formed XML document then we would normally use the XML or xml2 package to parse it. We only have snippets in the question and the actual format would be important to know but as an exakmple let us say that we have the format in the Note at the end. That is each tag that we want is directly under the root. The other ones are more than one level down. Then
library(magrittr)
library(xml2)
Lines %>%
read_xml %>%
xml_find_all("./PMID") %>%
xml_text
## [1] "30556505"
Alternately there are a number of R packages for accessing PubMed including easyPubMed, pubmed.mineR, rentrez and RISmed on CRAN, annotate on Bioconductor and Rcupcake on github.
Assumed input:
Lines <-
"<root>
<PMID Version=\"1\">30556505</PMID>
<CommentsCorrectionsList>
<CommentsCorrections RefType=\"CommentIn\">
<RefSource>Gastroenterology. 2019 Feb;156(3):545-546</RefSource>
<PMID Version=\"1\">30641052</PMID>
</CommentsCorrections>
</CommentsCorrectionsList>
</root>"
Upvotes: 1