user1357015
user1357015

Reputation: 11686

R regular expression of XML tag that's NOT in another tag

I'm trying to extract PMID values which are pubmed journal identifiers. A typical one looks like: <PMID Version=\"1\">30556505</PMID>

I extract that with:

strapplyc(startingString, "<PMID Version=\"1\">(.*?)</PMID>", simplify = c)

The reason I use strapplyc as there could be several of those PMID values in the xml string. However, some of them I do not want, specifically those wrapped in a comments/correction tag (example):

<CommentsCorrectionsList> <CommentsCorrections RefType=\"CommentIn\"> <RefSource>Gastroenterology. 2019 Feb;156(3):545-546</RefSource> <PMID Version=\"1\">30641052</PMID> </CommentsCorrections> </CommentsCorrectionsList>

How would be the regular expression need to be changed to ignore those in the CommentsCorrectionsList tag?

The packages are: gsubfn for strapplyc

Upvotes: 0

Views: 84

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269744

If we have a well formed XML document then we would normally use the XML or xml2 package to parse it. We only have snippets in the question and the actual format would be important to know but as an exakmple let us say that we have the format in the Note at the end. That is each tag that we want is directly under the root. The other ones are more than one level down. Then

library(magrittr)
library(xml2)

Lines %>%
  read_xml %>%
  xml_find_all("./PMID") %>%
  xml_text
## [1] "30556505"

Alternately there are a number of R packages for accessing PubMed including easyPubMed, pubmed.mineR, rentrez and RISmed on CRAN, annotate on Bioconductor and Rcupcake on github.

Note

Assumed input:

Lines <- 
"<root>
<PMID Version=\"1\">30556505</PMID>
<CommentsCorrectionsList>
<CommentsCorrections RefType=\"CommentIn\">
<RefSource>Gastroenterology. 2019 Feb;156(3):545-546</RefSource>
<PMID Version=\"1\">30641052</PMID>
</CommentsCorrections>
</CommentsCorrectionsList>
</root>"

Upvotes: 1

Related Questions