Reputation: 71
I'm trying to work API which gives me data as nested XML and I want to save it as a data frame. My problem is that I don't know how to get values out of this nested XML. Here is a example:
# Sample data
library(xml2)
url <- "https://clinicaltrials.gov/api/query/full_studies?expr=neuro&min_rnk=1&max_rnk=20&fmt=xml"
download.file(url, destfile = "xml_data.xml")
fil <- "xml_data.xml"
dat <- xml2::read_xml(fil)
This gives a nested xml file, but I don't understand how to work with this structure.
<FullStudiesResponse>
....
<FullStudyList>
<FullStudy Rank="1">
<Struct Name="Study">
<Struct Name="ProtocolSection">
<Struct Name="IdentificationModule">
<Field Name="NCTId">NCT01843582</Field>
I can get to FullStudyList with command like:
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy")
But for example, if I want to get all NCTId
or Rank
values, how I can refer to it? So far I have tried
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy/NCTId")
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy/@NCTId")
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy//NCTId")
Which obviously won't work. Or is there better way to work with nested xml's to get data in a data frame?
Upvotes: 1
Views: 304
Reputation: 8844
The short answer is: don't use XML. The following documentation from that website says that you can specify the fmt
you want. It doesn't have to be XML. JSON is much easier to handle in R.
Try this
library(httr)
library(jsonlite)
library(tibble)
res <- fromJSON(content(GET("https://clinicaltrials.gov/api/query/full_studies?expr=neuro&min_rnk=1&max_rnk=20&fmt=json")))
The result is a nested list, but I guess that you are interested in the data stored in FullStudies
df <- as_tibble(res$FullStudiesResponse$FullStudies)
which gives us
# A tibble: 20 x 2
Rank Study$ProtocolS~ $$$OrgStudyIdIn~ $$$$OrgStudyIdT~ $$$$OrgStudyIdL~ $$$Organization~ $$$$OrgClass $$$BriefTitle $$$OfficialTitle $$$Acronym $$StatusModule$~
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 NCT02642055 NEURO+001 NA NA Neuro+ INDUSTRY Efficacy of ~ Efficacy of NEU~ NA May 2016
2 2 NCT01801813 RC12_0416 NA NA Nantes Universi~ OTHER Risk Factors~ Observational S~ Craniosco~ March 2016
3 3 NCT03813290 DSRB A/2018/006~ NA NA National Health~ OTHER_GOV A Neuro-Tech~ A Neuro-Technol~ NA February 2020
4 4 NCT03773926 2018-A00604-51 NA NA Zeta Technologi~ INDUSTRY Neuro-feedba~ Neuro-feedback ~ TNTA December 2018
5 5 NCT04189172 AAG-O-H-1630 NA NA Aesculap AG INDUSTRY MiDura-Study~ Multicenter, In~ MiDura May 2020
6 6 NCT03756337 PIC-20 NA NA Oticon Medical INDUSTRY Neuro 1 vs. ~ Comparison of A~ NA November 2018
7 7 NCT03484143 P17.03 NA NA Vielight Inc. INDUSTRY Neuro RX Gam~ Vielight Neuro ~ NA June 2020
8 8 NCT02138110 InVivo-100-101 NA NA InVivo Therapeu~ INDUSTRY The INSPIRE ~ The INSPIRE Stu~ NA December 2019
9 9 NCT03935724 A2017SCI03 NA NA Neuroplast INDUSTRY Clinical Stu~ A Multi-center,~ SCI2 September 2020
10 10 NCT03798002 RiphahI Maryam ~ NA NA Riphah Internat~ OTHER Neuro-muscul~ Effects of Neur~ NA August 2019
11 11 NCT03655262 R61MH113772 U.S. NIH Grant/~ https://project~ University of C~ OTHER Treating Pho~ Treating Phobia~ NA April 2019
12 12 NCT04418609 Neuro-COVID-19 NA NA University of Z~ OTHER Neuro-COVID-~ Neuro-COVID-19:~ Neuro-COV~ June 2020
13 13 NCT01174329 1234 NA NA Universidad Aut~ OTHER Treatment of~ Difference in S~ SALELECTR~ July 2010
14 14 NCT04205019 A2019SCI04 NA NA Neuroplast INDUSTRY Safety Stem ~ A 3 Months Open~ SSCiSCI September 2020
15 15 NCT02941627 PIC_07 NA NA Oticon Medical INDUSTRY The Neuro Zt~ The Neuro Zti C~ NA February 2017
16 16 NCT03328195 P17.02 NA NA Vielight Inc. INDUSTRY Vielight Neu~ A Pilot Study E~ NA September 2020
17 17 NCT02401841 Policlinico 12 NA NA Policlinico Hos~ OTHER Resolution o~ Resolution of N~ NA October 2015
18 18 NCT03882567 03/2015 NA NA Universidad Rey~ OTHER Effectivenes~ Effectiveness o~ SCENAR October 2019
19 19 NCT04583163 2019-0945 NA NA Hackensack Meri~ OTHER Variability ~ Inter- and Intr~ NA October 2020
20 20 NCT01845155 CMTR-TC-02 NA NA German Center f~ OTHER Neuro-Music-~ Neuro-Music-The~ NA February 2014
# ... with 103 more variables: $$$OverallStatus <chr>, $$$ExpandedAccessInfo$HasExpandedAccess <chr>, $$$StartDateStruct$StartDate <chr>, $$$$StartDateType <chr>,
# $$$PrimaryCompletionDateStruct$PrimaryCompletionDate <chr>, $$$$PrimaryCompletionDateType <chr>, $$$CompletionDateStruct$CompletionDate <chr>,
# $$$$CompletionDateType <chr>, $$$StudyFirstSubmitDate <chr>, $$$StudyFirstSubmitQCDate <chr>, $$$StudyFirstPostDateStruct$StudyFirstPostDate <chr>,
# $$$$StudyFirstPostDateType <chr>, $$$LastUpdateSubmitDate <chr>, $$$LastUpdatePostDateStruct$LastUpdatePostDate <chr>, $$$$LastUpdatePostDateType <chr>,
# $$$ResultsFirstSubmitDate <chr>, $$$ResultsFirstSubmitQCDate <chr>, $$$ResultsFirstPostDateStruct$ResultsFirstPostDate <chr>, $$$$ResultsFirstPostDateType <chr>,
# $$$LastKnownStatus <chr>, $$SponsorCollaboratorsModule$ResponsibleParty$ResponsiblePartyType <chr>, $$$$ResponsiblePartyInvestigatorFullName <chr>,
# $$$$ResponsiblePartyInvestigatorTitle <chr>, $$$$ResponsiblePartyInvestigatorAffiliation <chr>, $$$$ResponsiblePartyOldNameTitle <chr>,
# $$$$ResponsiblePartyOldOrganization <chr>, $$$LeadSponsor$LeadSponsorName <chr>, $$$$LeadSponsorClass <chr>, $$$CollaboratorList$Collaborator <list>,
# $$OversightModule$OversightHasDMC <chr>, $$$IsFDARegulatedDrug <chr>, $$$IsFDARegulatedDevice <chr>, $$$IsUnapprovedDevice <chr>, $$$IsUSExport <chr>,
# $$DescriptionModule$BriefSummary <chr>, $$$DetailedDescription <chr>, $$ConditionsModule$ConditionList$Condition <list>, $$$KeywordList$Keyword <list>,
# $$DesignModule$StudyType <chr>, $$$PhaseList$Phase <list>, $$$DesignInfo$DesignAllocation <chr>, $$$$DesignInterventionModel <chr>,
# $$$$DesignPrimaryPurpose <chr>, $$$$DesignMaskingInfo$DesignMasking <chr>, $$$$$DesignWhoMaskedList$DesignWhoMasked <list>, $$$$$DesignMaskingDescription <chr>,
# $$$$DesignObservationalModelList$DesignObservationalModel <list>, $$$$DesignTimePerspectiveList$DesignTimePerspective <list>,
# $$$$DesignInterventionModelDescription <chr>, $$$EnrollmentInfo$EnrollmentCount <chr>, $$$$EnrollmentType <chr>, $$$PatientRegistry <chr>,
# $$$TargetDuration <chr>, $$ArmsInterventionsModule$ArmGroupList$ArmGroup <list>, $$$InterventionList$Intervention <list>,
# $$OutcomesModule$PrimaryOutcomeList$PrimaryOutcome <list>, $$$SecondaryOutcomeList$SecondaryOutcome <list>, $$$OtherOutcomeList$OtherOutcome <list>,
# $$EligibilityModule$EligibilityCriteria <chr>, $$$HealthyVolunteers <chr>, $$$Gender <chr>, $$$MinimumAge <chr>, $$$MaximumAge <chr>, $$$StdAgeList$StdAge <list>,
# $$$StudyPopulation <chr>, $$$SamplingMethod <chr>, $$ContactsLocationsModule$OverallOfficialList$OverallOfficial <list>, $$$LocationList$Location <list>,
# $$$CentralContactList$CentralContact <list>, $$IPDSharingStatementModule$IPDSharing <chr>, $$ReferencesModule$ReferenceList$Reference <list>,
# $$$SeeAlsoLinkList$SeeAlsoLink <list>, $DerivedSection$MiscInfoModule$VersionHolder <chr>, $$$RemovedCountryList$RemovedCountry <list>,
# $$ConditionBrowseModule$ConditionMeshList$ConditionMesh <list>, $$$ConditionAncestorList$ConditionAncestor <list>,
# $$$ConditionBrowseLeafList$ConditionBrowseLeaf <list>, $$$ConditionBrowseBranchList$ConditionBrowseBranch <list>,
# $$InterventionBrowseModule$InterventionBrowseLeafList$InterventionBrowseLeaf <list>, $$$InterventionBrowseBranchList$InterventionBrowseBranch <list>,
# $ResultsSection$ParticipantFlowModule$FlowGroupList$FlowGroup <list>, $$$FlowPeriodList$FlowPeriod <list>, $$$FlowPreAssignmentDetails <chr>,
# $$$FlowRecruitmentDetails <chr>, $$BaselineCharacteristicsModule$BaselinePopulationDescription <chr>, $$$BaselineGroupList$BaselineGroup <list>,
# $$$BaselineDenomList$BaselineDenom <list>, $$$BaselineMeasureList$BaselineMeasure <list>, $$OutcomeMeasuresModule$OutcomeMeasureList$OutcomeMeasure <list>,
# $$AdverseEventsModule$EventsFrequencyThreshold <chr>, $$$EventsTimeFrame <chr>, $$$EventGroupList$EventGroup <list>, $$$SeriousEventList$SeriousEvent <list>,
# $$$OtherEventList$OtherEvent <list>, $$MoreInfoModule$CertainAgreement$AgreementPISponsorEmployee <chr>, $$$$AgreementRestrictiveAgreement <chr>,
# $$$PointOfContact$PointOfContactTitle <chr>, $$$$PointOfContactOrganization <chr>, $$$$PointOfContactEMail <chr>, $$$$PointOfContactPhone <chr>, ...
Upvotes: 3