Eric
Eric

Reputation: 53

parsing xml content using R to extract the title information

I have a xml data

<?xml version="1.0" encoding="UTF-8"?>
<ClinVarResult-Set>
  <ClinVarSet ID="95075">
    <RecordStatus>not current</RecordStatus>
    <Title>MPV17, 26-BP DEL, NT116 AND Navajo neurohepatopathy</Title>
    <ReferenceClinVarAssertion DateCreated="2012-08-13" DateLastUpdated="2013-04-03" ID="75049">
      <ClinVarAccession Acc="RCV000017546" Version="1" Type="RCV" DateUpdated="2013-04-08"/>
      <RecordStatus>current</RecordStatus>
      <ClinicalSignificance DateLastEvaluated="2011-11-17">
        <ReviewStatus>classified by single submitter</ReviewStatus>
        <Description>pathogenic</Description>
      </ClinicalSignificance>
      <Assertion Type="variation to disease"/>
      <ExternalID DB="NCBI"/>
      <ObservedIn>
        <Sample>
          <Origin>germline</Origin>
          <Species TaxonomyId="9606">human</Species>
          <AffectedStatus>not provided</AffectedStatus>
        </Sample>
        <Method>
          <MethodType>curation</MethodType>
        </Method>
        <ObservedData ID="208542">
          <Attribute Type="Description">See 137960.0003 and Spinazzola et al. (2006).</Attribute>
          <Citation Type="general">
            <ID Source="PubMed">16582910</ID>
          </Citation>
        </ObservedData>
      </ObservedIn>
      <MeasureSet Type="Variant" ID="16163">
        <Measure Type="Deletion" ID="31202">
          <Name>
            <ElementValue Type="Alternate">MPV17, 26-BP DEL, NT116</ElementValue>
            <XRef Type="Allelic variant" ID="137960.0004" DB="OMIM"/>
          </Name>
          <AttributeSet>
            <Attribute Type="nucleotide change">26-BP DEL, NT116</Attribute>
            <XRef Type="Allelic variant" ID="137960.0004" DB="OMIM"/>
          </AttributeSet>
          <MeasureRelationship Type="variant in gene">
            <Name>
              <ElementValue Type="Preferred">MpV17 mitochondrial inner membrane protein</ElementValue>
            </Name>
            <Symbol>
              <ElementValue Type="Preferred">MPV17</ElementValue>
            </Symbol>
            <XRef ID="4358" DB="Gene"/>
            <XRef ID="137960" DB="OMIM" Type="MIM"/>
          </MeasureRelationship>
          <XRef Type="Allelic variant" ID="137960.0004" DB="OMIM"/>
        </Measure>
      </MeasureSet>
      <TraitSet Type="Disease" ID="5245">
        <Trait ID="3439" Type="Disease">
          <Name>
            <ElementValue Type="Preferred">Navajo neurohepatopathy</ElementValue>
            <XRef ID="3972" DB="Office of Rare Diseases"/>
          </Name>
          <Name>
            <ElementValue Type="Alternate">Navajo neuropathy</ElementValue>
          </Name>
          <Name>
            <ElementValue Type="Alternate">MITOCHONDRIAL DNA DEPLETION SYNDROME 6 (HEPATOCEREBRAL TYPE)</ElementValue>
            <XRef Type="MIM" ID="256810" DB="OMIM"/>
            <XRef Type="Allelic variant" ID="137960.0002" DB="OMIM"/>
            <XRef Type="Allelic variant" ID="137960.0003" DB="OMIM"/>
            <XRef Type="Allelic variant" ID="137960.0005" DB="OMIM"/>
            <XRef Type="Allelic variant" ID="137960.0004" DB="OMIM"/>
            <XRef Type="Allelic variant" ID="137960.0001" DB="OMIM"/>
            <XRef Type="Allelic variant" ID="137960.0006" DB="OMIM"/>
            <XRef Type="Allelic variant" ID="137960.0007" DB="OMIM"/>
          </Name>
          <Name>
            <ElementValue Type="Alternate">MPV17- Related Hepatocerebral Mitochondrial DNA Depletion Syndrome</ElementValue>
            <XRef ID="NBK92947" DB="GeneReviews"/>
          </Name>
          <Symbol>
            <ElementValue Type="Preferred">MTDPS6</ElementValue>
            <XRef Type="MIM" ID="256810" DB="OMIM"/>
          </Symbol>
          <Symbol>
            <ElementValue Type="Alternate">NN</ElementValue>
            <XRef Type="MIM" ID="256810" DB="OMIM"/>
            <XRef ID="3972" DB="Office of Rare Diseases"/>
          </Symbol>
          <Symbol>
            <ElementValue Type="Alternate">NNH</ElementValue>
            <XRef Type="MIM" ID="256810" DB="OMIM"/>
          </Symbol>
          <AttributeSet>
            <Attribute Type="age of onset">Childhood</Attribute>
          </AttributeSet>
          <Citation Type="review" Abbrev="GeneReviews">
            <ID Source="PubMed">22593919</ID>
          </Citation>
          <XRef ID="255229" DB="Orphanet"/>
          <XRef ID="C1850406" DB="MedGen"/>
          <XRef ID="NBK92947" DB="GeneReviews"/>
          <XRef Type="MIM" ID="256810" DB="OMIM"/>
        </Trait>
      </TraitSet>
    </ReferenceClinVarAssertion>
    <ClinVarAssertion ID="37818">
      <ClinVarSubmissionID localKey="137960.0004_MITOCHONDRIAL DNA DEPLETION SYNDROME 6 (HEPATOCEREBRAL TYPE)" title="MPV17, 26-BP DEL, NT116 _MITOCHONDRIAL DNA DEPLETION SYNDROME 6 (HEPATOCEREBRAL TYPE)" submitterDate="2011-11-17" submitter="OMIM"/>
      <ClinVarAccession Acc="SCV000037818" OrgID="3" Version="1" Type="SCV" DateUpdated="2013-04-08"/>
      <RecordStatus>current</RecordStatus>
      <ClinicalSignificance DateLastEvaluated="2011-11-17">
        <Description>pathogenic</Description>
      </ClinicalSignificance>
      <Assertion Type="variation to disease"/>
      <ObservedIn>
        <Sample>
          <Origin>germline</Origin>
          <Species>human</Species>
          <AffectedStatus>not provided</AffectedStatus>
        </Sample>
        <Method>
          <MethodType>curation</MethodType>
        </Method>
        <ObservedData>
          <Attribute Type="Description">See 137960.0003 and Spinazzola et al. (2006).</Attribute>
          <Citation>
            <ID Source="PubMed">16582910</ID>
          </Citation>
        </ObservedData>
      </ObservedIn>
      <MeasureSet Type="Variant">
        <Measure Type="Variation">
          <Name>
            <ElementValue Type="Preferred">MPV17, 26-BP DEL, NT116 </ElementValue>
          </Name>
          <AttributeSet>
            <Attribute Type="NonHGVS">26-BP DEL, NT116</Attribute>
          </AttributeSet>
          <MeasureRelationship Type="variant in gene">
            <Symbol>
              <ElementValue Type="Preferred">MPV17</ElementValue>
            </Symbol>
          </MeasureRelationship>
          <XRef DB="OMIM" Type="Allelic variant" ID="137960.0004"/>
        </Measure>
      </MeasureSet>
      <TraitSet Type="Disease">
        <Trait Type="Disease">
          <Name>
            <ElementValue Type="Preferred">MITOCHONDRIAL DNA DEPLETION SYNDROME 6 (HEPATOCEREBRAL TYPE)</ElementValue>
          </Name>
        </Trait>
      </TraitSet>
    </ClinVarAssertion>
  </ClinVarSet>
</ClinVarResult-Set>

I want to parse out the accession number "RCV000016204" bit in the ClinVarAccession title, I have tried:

getNodeSet(rcv_data, "//ClinVarAccession")

or

xmlRoot(rcv_data)[["ClinVarSet"]][["ReferenceClinVarAssertion"]][["ClinVarAccession"]]

but these all get me the whole line of the title when I just want the accession number, how do I just get the number? I have tried multiple ways and checked many examples but I just can't find an answer with it.

rcv_data is retrieved by:

library(XML)
library(httr)
UA <- "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
id=95075
rcv_search= paste("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&rettype=clinvarset&id=",id,sep="")
rcv_doc <- GET(rcv_search, user_agent(UA))
rcv_data <- xmlParse(content(rcv_doc, "text"))

Upvotes: 0

Views: 131

Answers (2)

Joshua Mire
Joshua Mire

Reputation: 736

Below are two sets of code (one using dplyr and the other not) that will return accession numbers that begin with "R" (as the code you provided returns two codes, one beginning in "R" and the other in "S", and you specified you want the one beginning in "R"):

## return accession number that begin with "R"
## call dplyr library in order to use pipes
library(dplyr)
## get the referenced nodes
getNodeSet(doc, "//ClinVarAccession") %>%
## get accession numbers from nodes
sapply(xmlGetAttr, "Acc") %>%
## return codes that start with "R"
.[grep("^[R].*", .)]

Or without the dplyr R library you could do something like this:

## get the referenced nodes
nodes <- getNodeSet(doc, "//ClinVarAccession")
## get accession numbers from nodes
accs <- sapply(nodes, xmlGetAttr, "Acc")
## return accession number that begin with "R"
accs[grep("^[R].*", accs)]

I hope this helps!

Upvotes: 1

Wimpel
Wimpel

Reputation: 27772

Here is my go at things... does that help?

library( xml2 )
id=95075
rcv_search= paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&rettype=clinvarset&id=",id)
#read data
rcv_data <- xml2::read_xml(rcv_search)

xml2::xml_find_all( rcv_data, ".//ClinVarAccession" )

# {xml_nodeset (2)}
# [1] <ClinVarAccession Acc="RCV000017546" Version="1" Type="RCV" DateUpdated="2013-04-08"/>
# [2] <ClinVarAccession Acc="SCV000037818" OrgID="3" Version="1" Type="SCV" DateUpdated="2013-04-08"/>

please add expected output for further processing of nodes...

Upvotes: 0

Related Questions