Gene Burinsky
Gene Burinsky

Reputation: 10233

HTML character entity replacement in R

I have a large set of HTML files that contain text from a magazine in nodes span. My PDF to HTML converter inserted the character entity   throughout the HTML. The problem is that in R, I use the xmlValue function (in XML package) to extract the text but wherever there was a   the space between the words is eliminated. For example:

<span class="ft6">kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>

will come out of the xmlValue function as:

"kids,and kids in your community,in DIYprojects."

I was thinking that the easiest way to resolve this would be to find all &nbsp; before running the span nodes through xmlValue, and replace them with a " " (space). How would I approach that?

Upvotes: 1

Views: 1396

Answers (1)

SlowLearner
SlowLearner

Reputation: 7997

I have re-written the answer to reflect the problem of the original poster not being able to get text from an XMLValue. There's probably different ways to tackle this but one way is to just to directly open/replace/write the HTML files themselves. Generally tackling XML/HTML with regexes is A Bad Idea but in this case we have a straightforward problem of unwanted non-breaking spaces, so it's likely not too much of an issue. The following code is an example of how to create a list of matching files and perform a gsub on the contents. It should be easy to modify or expand as needed.

setwd("c:/test/")
# Create 'html' file to use with test
txt <- "<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>
<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>
<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>"
writeLines(txt, "file1.html")

# Now read files - in this case only one
html.files <- list.files(pattern = ".html")
html.files

# Loop through the list of files
retval <- lapply(html.files, function(x) {
          in.lines <- readLines(x, n = -1)
          # Replace non-breaking space with space
          out.lines <- gsub("&nbsp;"," ", in.lines)
          # Write out the corrected lines to a new file
          writeLines(out.lines, paste("new_", x, sep = ""))
})

Upvotes: 1

Related Questions