Reputation: 105
I'm using XML2
to pull publication data out of an online XML doc, like this one, with this code:
xF <- read_xml(target, encoding = "UTF-8") ## target = above link
No problems getting items that exist for each publication node.
Titles <- xml_text(xml_find_all(xF, "//publication-base_uk:title", xml_ns(xF)))
Pub.Lang <- xml_text(xml_find_all(xF, "//publication-base_uk:language/core:term/core:localizedString", xml_ns(xF)))
## etc...
However, I'm stumped as to how to get items that don't always have an entry, like the peer review tag.
Peer.Rev <- xml_text(xml_find_all(xF, "//extensions-core:peerReviewed", xml_ns(xF)))
Returns a value for all of the publications with a child for peerReviewed but since some of the peerReviewed tags have no child, the count is off. Is there a way to put an NA (or anything really) in place of the missing text values?
Thanks an advance.
Upvotes: 5
Views: 1228
Reputation: 2832
Using xml2::xml_find_first()
should get you what you want.
Let's say we want the blog post categories from this xml rss feed: https://eagereyes.org/feed. Some of these posts have one category, some have more than one. Searching for one works just fine:
feed <- "https://eagereyes.org/feed"
doc <- httr::GET(feed) %>% xml2::read_xml()
channel <- xml2::xml_find_all(doc, "channel")
site <- xml2::xml_find_all(channel, "item")
categories <- tibble::tibble(
category1 = xml2::xml_text(xml2::xml_find_all(site, "category[1]"))
)
> categories
# A tibble: 10 x 1
category1
<chr>
1 Papers
2 Blog 2017
3 Links
4 Blog 2017
5 Blog 2017
6 Talk
7 ISOTYPE Books
8 Techniques
9 Basics
10 Blog 2017
But trying this for more than one does not:
categories <- tibble::tibble(
category1 = xml2::xml_text(xml2::xml_find_all(site, "category[1]")),
category2 = xml2::xml_text(xml2::xml_find_all(site, "category[2]"))
)
Error: Column `category2` must be length 1 or 10, not 3
xml_find_first
to the rescue:
categories <- tibble::tibble(
category1 = xml2::xml_text(xml2::xml_find_first(site, "category[1]")),
category2 = xml2::xml_text(xml2::xml_find_first(site, "category[2]"))
)
> categories
# A tibble: 10 x 2
category1 category2
<chr> <chr>
1 Papers paper
2 Blog 2017 conference
3 Links <NA>
4 Blog 2017 <NA>
5 Blog 2017 <NA>
6 Talk <NA>
7 ISOTYPE Books isotype
8 Techniques <NA>
9 Basics <NA>
10 Blog 2017 <NA>
Hope that helps.
Upvotes: 5