Mark
Mark

Reputation: 4537

R - Replace xml tags with whitespace using rvest

I'm reading in an XML file in R using xml2 and rvest. The XML has the following structure (headers not included). I want to extract all the text between <w:p></w:p> but first I want to convert all <w:br/> to be whitespace.

<w:p><w:r><w:t>First bit of text</w:t></w:r><w:r><w:br/><w:t>Thank you!</w:t></w:r></w:p>

When I use the following code (with completely legitimate xml)

 xml = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>   
   <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se wp14">
   <w:body><w:p w:rsidR="00C87F35" w:rsidRDefault="008836BC" w:rsidP="008836BC"><w:pPr>
   <w:pStyle w:val="Heading1"/></w:pPr>  
   <w:r><w:t>Example .</w:t></w:r>
   <w:proofErr w:type="spellStart"/><w:r><w:t>docx</w:t></w:r><w:proofErr w:type="spellEnd"/>
   <w:r><w:t xml:space="preserve"> file</w:t></w:r></w:p>
   <w:p w:rsidR="008836BC" w:rsidRDefault="008836BC" w:rsidP="008836BC">
   <w:r><w:t>This is an example .</w:t></w:r>
   <w:proofErr w:type="spellStart"/>
   <w:r><w:t>docx</w:t></w:r><w:proofErr w:type="spellEnd"/>
   <w:r><w:t xml:space="preserve"> file included with the ‘</w:t></w:r>  
   <w:proofErr w:type="spellStart"/><w:r>
   <w:t>readOffice</w:t></w:r>
   <w:proofErr w:type="spellEnd"/> 
   <w:r><w:t>’ package to demonstrate functionality.</w:t></w:r></w:p>
   <w:p w:rsidR="008836BC" w:rsidRPr="008836BC" w:rsidRDefault="008836BC" w:rsidP="008836BC">
   <w:r><w:t>There is nothing exciting in this file!</w:t></w:r>
   <w:r><w:br/><w:t>Thank you!</w:t></w:r>
   <w:bookmarkStart w:id="0" w:name="_GoBack"/>
   <w:bookmarkEnd w:id="0"/></w:p>
   <w:sectPr w:rsidR="008836BC" w:rsidRPr="008836BC">
   <w:pgSz w:w="12240" w:h="15840"/>
   <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
   <w:cols w:space="720"/>
   <w:docGrid w:linePitch="360"/></w:sectPr>
   </w:body></w:document>'


    xml2::read_xml(xml) %>% 
      rvest::xml_nodes('w\\:p') %>% 
      xml2::xml_text()

The results are:

[1] "Example .docx file"                                                                                                 
[2] "This is an example .docx file included with the \u0091readOffice\u0092 package to demonstrate functionality."       
[3] "There is nothing exciting in this file!Thank you!"

but the line break <w:br/> has just disappeared leaving no space between the final exclamation mark and the word Thank.

In the actual application, I'm reading in a file of XML, not a string (using the read_xml function) and so it's not a simple gsub solution I'm looking for. Or maybe it is because that's the only fix. But what I'm wondering is, how can I use rvest and xml2 to convert specific tags into whitespace?

UPDATE

So it was suggested to use the normalize-space capability as an xpath in another answer.

paragraphs = xml2::read_xml(xml) %>% 
  rvest::xml_nodes('w\\:p')
purrr::map(paragraphs,function(x){
  paste(xml2::xml_text(rvest::xml_nodes(x,xpath=".//text()[normalize-space()]")),collapse=" ")
})

This doesn't produce the desired result however because the text is split on every tag including <w:r> and <w:t> so there are now extra spaces introduced. Note in the first two elements there's a space in '.docx' and in the second, there are spaces introduced in "'readOffice'".

[[1]]
[1] "Example . docx  file"

[[2]]
[1] "This is an example . docx  file included with the ‘ readOffice ’ package to demonstrate functionality."

[[3]]
[1] "There is nothing exciting in this file, but if you’re reading it, it means you installed my package! Thank you!"

I know the spaces are due to my use of collapse=" " but if I use collapse="" then the results are unchanged from the original code.

Upvotes: 0

Views: 787

Answers (1)

GGamba
GGamba

Reputation: 13680

This may be not needed anymore, but you can substitute each w:br node's (empty) text with a new line character, and then extract the whole text:

library(rvest)
library(purrr)

read_xml(xml) %>% 
    xml_nodes('w\\:p') %>% 
    map(~{
        xml_nodes(.x, 'w\\:br') %>% `xml_text<-`('\n')

        xml_text(.x)
    }) -> r

cat(r[[3]])
#> There is nothing exciting in this file!
#> Thank you!

Upvotes: 1

Related Questions