R - Replace xml tags with whitespace using rvest

Question

I'm reading in an XML file in R using xml2 and rvest. The XML has the following structure (headers not included). I want to extract all the text between but first I want to convert all to be whitespace.

First bit of textThank you!

When I use the following code (with completely legitimate xml)

 xml = '   
   
   
     
   Example .
   docx
    file
   
   This is an example .
   
   docx
    file included with the ‘  
   
   readOffice
    
   ’ package to demonstrate functionality.
   
   There is nothing exciting in this file!
   Thank you!
   
   
   
   
   
   
   
   '


    xml2::read_xml(xml) %>% 
      rvest::xml_nodes('w\:p') %>% 
      xml2::xml_text()

The results are:

[1] "Example .docx file"                                                                                                 
[2] "This is an example .docx file included with the \u0091readOffice\u0092 package to demonstrate functionality."       
[3] "There is nothing exciting in this file!Thank you!"

but the line break has just disappeared leaving no space between the final exclamation mark and the word Thank.

In the actual application, I'm reading in a file of XML, not a string (using the read_xml function) and so it's not a simple gsub solution I'm looking for. Or maybe it is because that's the only fix. But what I'm wondering is, how can I use rvest and xml2 to convert specific tags into whitespace?

UPDATE

So it was suggested to use the normalize-space capability as an xpath in another answer.

paragraphs = xml2::read_xml(xml) %>% 
  rvest::xml_nodes('w\:p')
purrr::map(paragraphs,function(x){
  paste(xml2::xml_text(rvest::xml_nodes(x,xpath=".//text()[normalize-space()]")),collapse=" ")
})

This doesn't produce the desired result however because the text is split on every tag including and so there are now extra spaces introduced. Note in the first two elements there's a space in '.docx' and in the second, there are spaces introduced in "'readOffice'".

[[1]]
[1] "Example . docx  file"

[[2]]
[1] "This is an example . docx  file included with the ‘ readOffice ’ package to demonstrate functionality."

[[3]]
[1] "There is nothing exciting in this file, but if you’re reading it, it means you installed my package! Thank you!"

I know the spaces are due to my use of collapse=" " but if I use collapse="" then the results are unchanged from the original code.

R - Replace xml tags with whitespace using rvest

Answers (1)

Related Questions