jkm
jkm

Reputation: 71

How to get data out of nested xml structure?

I'm trying to work API which gives me data as nested XML and I want to save it as a data frame. My problem is that I don't know how to get values out of this nested XML. Here is a example:

# Sample data
library(xml2)
url <- "https://clinicaltrials.gov/api/query/full_studies?expr=neuro&min_rnk=1&max_rnk=20&fmt=xml"
download.file(url, destfile = "xml_data.xml")
fil <- "xml_data.xml"
dat <- xml2::read_xml(fil)

This gives a nested xml file, but I don't understand how to work with this structure.

<FullStudiesResponse>
  ....
  <FullStudyList>
    <FullStudy Rank="1">
      <Struct Name="Study">
        <Struct Name="ProtocolSection">
          <Struct Name="IdentificationModule">
            <Field Name="NCTId">NCT01843582</Field>

I can get to FullStudyList with command like:

xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy")

But for example, if I want to get all NCTId or Rank values, how I can refer to it? So far I have tried

xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy/NCTId")
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy/@NCTId")
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy//NCTId")

Which obviously won't work. Or is there better way to work with nested xml's to get data in a data frame?

Upvotes: 1

Views: 304

Answers (1)

ekoam
ekoam

Reputation: 8844

The short answer is: don't use XML. The following documentation from that website says that you can specify the fmt you want. It doesn't have to be XML. JSON is much easier to handle in R.

documentaion

Try this

library(httr)
library(jsonlite)
library(tibble)

res <- fromJSON(content(GET("https://clinicaltrials.gov/api/query/full_studies?expr=neuro&min_rnk=1&max_rnk=20&fmt=json")))

The result is a nested list, but I guess that you are interested in the data stored in FullStudies

df <- as_tibble(res$FullStudiesResponse$FullStudies)

which gives us

# A tibble: 20 x 2
    Rank Study$ProtocolS~ $$$OrgStudyIdIn~ $$$$OrgStudyIdT~ $$$$OrgStudyIdL~ $$$Organization~ $$$$OrgClass $$$BriefTitle $$$OfficialTitle $$$Acronym $$StatusModule$~
   <int> <chr>            <chr>            <chr>            <chr>            <chr>            <chr>        <chr>         <chr>            <chr>      <chr>           
 1     1 NCT02642055      NEURO+001        NA               NA               Neuro+           INDUSTRY     Efficacy of ~ Efficacy of NEU~ NA         May 2016        
 2     2 NCT01801813      RC12_0416        NA               NA               Nantes Universi~ OTHER        Risk Factors~ Observational S~ Craniosco~ March 2016      
 3     3 NCT03813290      DSRB A/2018/006~ NA               NA               National Health~ OTHER_GOV    A Neuro-Tech~ A Neuro-Technol~ NA         February 2020   
 4     4 NCT03773926      2018-A00604-51   NA               NA               Zeta Technologi~ INDUSTRY     Neuro-feedba~ Neuro-feedback ~ TNTA       December 2018   
 5     5 NCT04189172      AAG-O-H-1630     NA               NA               Aesculap AG      INDUSTRY     MiDura-Study~ Multicenter, In~ MiDura     May 2020        
 6     6 NCT03756337      PIC-20           NA               NA               Oticon Medical   INDUSTRY     Neuro 1 vs. ~ Comparison of A~ NA         November 2018   
 7     7 NCT03484143      P17.03           NA               NA               Vielight Inc.    INDUSTRY     Neuro RX Gam~ Vielight Neuro ~ NA         June 2020       
 8     8 NCT02138110      InVivo-100-101   NA               NA               InVivo Therapeu~ INDUSTRY     The INSPIRE ~ The INSPIRE Stu~ NA         December 2019   
 9     9 NCT03935724      A2017SCI03       NA               NA               Neuroplast       INDUSTRY     Clinical Stu~ A Multi-center,~ SCI2       September 2020  
10    10 NCT03798002      RiphahI Maryam ~ NA               NA               Riphah Internat~ OTHER        Neuro-muscul~ Effects of Neur~ NA         August 2019     
11    11 NCT03655262      R61MH113772      U.S. NIH Grant/~ https://project~ University of C~ OTHER        Treating Pho~ Treating Phobia~ NA         April 2019      
12    12 NCT04418609      Neuro-COVID-19   NA               NA               University of Z~ OTHER        Neuro-COVID-~ Neuro-COVID-19:~ Neuro-COV~ June 2020       
13    13 NCT01174329      1234             NA               NA               Universidad Aut~ OTHER        Treatment of~ Difference in S~ SALELECTR~ July 2010       
14    14 NCT04205019      A2019SCI04       NA               NA               Neuroplast       INDUSTRY     Safety Stem ~ A 3 Months Open~ SSCiSCI    September 2020  
15    15 NCT02941627      PIC_07           NA               NA               Oticon Medical   INDUSTRY     The Neuro Zt~ The Neuro Zti C~ NA         February 2017   
16    16 NCT03328195      P17.02           NA               NA               Vielight Inc.    INDUSTRY     Vielight Neu~ A Pilot Study E~ NA         September 2020  
17    17 NCT02401841      Policlinico 12   NA               NA               Policlinico Hos~ OTHER        Resolution o~ Resolution of N~ NA         October 2015    
18    18 NCT03882567      03/2015          NA               NA               Universidad Rey~ OTHER        Effectivenes~ Effectiveness o~ SCENAR     October 2019    
19    19 NCT04583163      2019-0945        NA               NA               Hackensack Meri~ OTHER        Variability ~ Inter- and Intr~ NA         October 2020    
20    20 NCT01845155      CMTR-TC-02       NA               NA               German Center f~ OTHER        Neuro-Music-~ Neuro-Music-The~ NA         February 2014   
# ... with 103 more variables: $$$OverallStatus <chr>, $$$ExpandedAccessInfo$HasExpandedAccess <chr>, $$$StartDateStruct$StartDate <chr>, $$$$StartDateType <chr>,
#   $$$PrimaryCompletionDateStruct$PrimaryCompletionDate <chr>, $$$$PrimaryCompletionDateType <chr>, $$$CompletionDateStruct$CompletionDate <chr>,
#   $$$$CompletionDateType <chr>, $$$StudyFirstSubmitDate <chr>, $$$StudyFirstSubmitQCDate <chr>, $$$StudyFirstPostDateStruct$StudyFirstPostDate <chr>,
#   $$$$StudyFirstPostDateType <chr>, $$$LastUpdateSubmitDate <chr>, $$$LastUpdatePostDateStruct$LastUpdatePostDate <chr>, $$$$LastUpdatePostDateType <chr>,
#   $$$ResultsFirstSubmitDate <chr>, $$$ResultsFirstSubmitQCDate <chr>, $$$ResultsFirstPostDateStruct$ResultsFirstPostDate <chr>, $$$$ResultsFirstPostDateType <chr>,
#   $$$LastKnownStatus <chr>, $$SponsorCollaboratorsModule$ResponsibleParty$ResponsiblePartyType <chr>, $$$$ResponsiblePartyInvestigatorFullName <chr>,
#   $$$$ResponsiblePartyInvestigatorTitle <chr>, $$$$ResponsiblePartyInvestigatorAffiliation <chr>, $$$$ResponsiblePartyOldNameTitle <chr>,
#   $$$$ResponsiblePartyOldOrganization <chr>, $$$LeadSponsor$LeadSponsorName <chr>, $$$$LeadSponsorClass <chr>, $$$CollaboratorList$Collaborator <list>,
#   $$OversightModule$OversightHasDMC <chr>, $$$IsFDARegulatedDrug <chr>, $$$IsFDARegulatedDevice <chr>, $$$IsUnapprovedDevice <chr>, $$$IsUSExport <chr>,
#   $$DescriptionModule$BriefSummary <chr>, $$$DetailedDescription <chr>, $$ConditionsModule$ConditionList$Condition <list>, $$$KeywordList$Keyword <list>,
#   $$DesignModule$StudyType <chr>, $$$PhaseList$Phase <list>, $$$DesignInfo$DesignAllocation <chr>, $$$$DesignInterventionModel <chr>,
#   $$$$DesignPrimaryPurpose <chr>, $$$$DesignMaskingInfo$DesignMasking <chr>, $$$$$DesignWhoMaskedList$DesignWhoMasked <list>, $$$$$DesignMaskingDescription <chr>,
#   $$$$DesignObservationalModelList$DesignObservationalModel <list>, $$$$DesignTimePerspectiveList$DesignTimePerspective <list>,
#   $$$$DesignInterventionModelDescription <chr>, $$$EnrollmentInfo$EnrollmentCount <chr>, $$$$EnrollmentType <chr>, $$$PatientRegistry <chr>,
#   $$$TargetDuration <chr>, $$ArmsInterventionsModule$ArmGroupList$ArmGroup <list>, $$$InterventionList$Intervention <list>,
#   $$OutcomesModule$PrimaryOutcomeList$PrimaryOutcome <list>, $$$SecondaryOutcomeList$SecondaryOutcome <list>, $$$OtherOutcomeList$OtherOutcome <list>,
#   $$EligibilityModule$EligibilityCriteria <chr>, $$$HealthyVolunteers <chr>, $$$Gender <chr>, $$$MinimumAge <chr>, $$$MaximumAge <chr>, $$$StdAgeList$StdAge <list>,
#   $$$StudyPopulation <chr>, $$$SamplingMethod <chr>, $$ContactsLocationsModule$OverallOfficialList$OverallOfficial <list>, $$$LocationList$Location <list>,
#   $$$CentralContactList$CentralContact <list>, $$IPDSharingStatementModule$IPDSharing <chr>, $$ReferencesModule$ReferenceList$Reference <list>,
#   $$$SeeAlsoLinkList$SeeAlsoLink <list>, $DerivedSection$MiscInfoModule$VersionHolder <chr>, $$$RemovedCountryList$RemovedCountry <list>,
#   $$ConditionBrowseModule$ConditionMeshList$ConditionMesh <list>, $$$ConditionAncestorList$ConditionAncestor <list>,
#   $$$ConditionBrowseLeafList$ConditionBrowseLeaf <list>, $$$ConditionBrowseBranchList$ConditionBrowseBranch <list>,
#   $$InterventionBrowseModule$InterventionBrowseLeafList$InterventionBrowseLeaf <list>, $$$InterventionBrowseBranchList$InterventionBrowseBranch <list>,
#   $ResultsSection$ParticipantFlowModule$FlowGroupList$FlowGroup <list>, $$$FlowPeriodList$FlowPeriod <list>, $$$FlowPreAssignmentDetails <chr>,
#   $$$FlowRecruitmentDetails <chr>, $$BaselineCharacteristicsModule$BaselinePopulationDescription <chr>, $$$BaselineGroupList$BaselineGroup <list>,
#   $$$BaselineDenomList$BaselineDenom <list>, $$$BaselineMeasureList$BaselineMeasure <list>, $$OutcomeMeasuresModule$OutcomeMeasureList$OutcomeMeasure <list>,
#   $$AdverseEventsModule$EventsFrequencyThreshold <chr>, $$$EventsTimeFrame <chr>, $$$EventGroupList$EventGroup <list>, $$$SeriousEventList$SeriousEvent <list>,
#   $$$OtherEventList$OtherEvent <list>, $$MoreInfoModule$CertainAgreement$AgreementPISponsorEmployee <chr>, $$$$AgreementRestrictiveAgreement <chr>,
#   $$$PointOfContact$PointOfContactTitle <chr>, $$$$PointOfContactOrganization <chr>, $$$$PointOfContactEMail <chr>, $$$$PointOfContactPhone <chr>, ...

Upvotes: 3

Related Questions