J.M.S.
J.M.S.

Reputation: 105

R & XML2: Replace missing XML elements with NA

I'm using XML2 to pull publication data out of an online XML doc, like this one, with this code:

xF <- read_xml(target, encoding = "UTF-8")   ## target = above link

No problems getting items that exist for each publication node.

Titles <- xml_text(xml_find_all(xF, "//publication-base_uk:title", xml_ns(xF)))
Pub.Lang <- xml_text(xml_find_all(xF, "//publication-base_uk:language/core:term/core:localizedString", xml_ns(xF)))
## etc...

However, I'm stumped as to how to get items that don't always have an entry, like the peer review tag.

Peer.Rev <- xml_text(xml_find_all(xF, "//extensions-core:peerReviewed", xml_ns(xF)))

Returns a value for all of the publications with a child for peerReviewed but since some of the peerReviewed tags have no child, the count is off. Is there a way to put an NA (or anything really) in place of the missing text values?

Thanks an advance.

Upvotes: 5

Views: 1228

Answers (1)

RobertMyles
RobertMyles

Reputation: 2832

Using xml2::xml_find_first() should get you what you want.

Example:

Let's say we want the blog post categories from this xml rss feed: https://eagereyes.org/feed. Some of these posts have one category, some have more than one. Searching for one works just fine:

feed <- "https://eagereyes.org/feed"
doc <- httr::GET(feed) %>% xml2::read_xml()
channel <- xml2::xml_find_all(doc, "channel")
site <- xml2::xml_find_all(channel, "item")

categories <- tibble::tibble(
    category1 = xml2::xml_text(xml2::xml_find_all(site, "category[1]"))
  )

> categories
# A tibble: 10 x 1
       category1
           <chr>
 1        Papers
 2     Blog 2017
 3         Links
 4     Blog 2017
 5     Blog 2017
 6          Talk
 7 ISOTYPE Books
 8    Techniques
 9        Basics
10     Blog 2017

But trying this for more than one does not:

categories <- tibble::tibble(
    category1 = xml2::xml_text(xml2::xml_find_all(site, "category[1]")),
    category2 = xml2::xml_text(xml2::xml_find_all(site, "category[2]"))
  )

Error: Column `category2` must be length 1 or 10, not 3

xml_find_first to the rescue:

categories <- tibble::tibble(
    category1 = xml2::xml_text(xml2::xml_find_first(site, "category[1]")),
    category2 = xml2::xml_text(xml2::xml_find_first(site, "category[2]"))
  )
> categories
# A tibble: 10 x 2
       category1  category2
           <chr>      <chr>
 1        Papers      paper
 2     Blog 2017 conference
 3         Links       <NA>
 4     Blog 2017       <NA>
 5     Blog 2017       <NA>
 6          Talk       <NA>
 7 ISOTYPE Books    isotype
 8    Techniques       <NA>
 9        Basics       <NA>
10     Blog 2017       <NA>

Hope that helps.

Upvotes: 5

Related Questions