Reputation: 10233
I have a large set of HTML files that contain text from a magazine in nodes span
. My PDF to HTML converter inserted the character entity
throughout the HTML. The problem is that in R, I use the xmlValue
function (in XML package) to extract the text but wherever there was a
the space between the words is eliminated. For example:
<span class="ft6">kids, and kids in your community, in DIY projects. </span>
will come out of the xmlValue
function as:
"kids,and kids in your community,in DIYprojects."
I was thinking that the easiest way to resolve this would be to find all
before running the span
nodes through xmlValue
, and replace them with a " "
(space). How would I approach that?
Upvotes: 1
Views: 1396
Reputation: 7997
I have re-written the answer to reflect the problem of the original poster not being able to get text from an XMLValue
. There's probably different ways to tackle this but one way is to just to directly open/replace/write the HTML files themselves. Generally tackling XML/HTML with regexes is A Bad Idea but in this case we have a straightforward problem of unwanted non-breaking spaces, so it's likely not too much of an issue. The following code is an example of how to create a list of matching files and perform a gsub
on the contents. It should be easy to modify or expand as needed.
setwd("c:/test/")
# Create 'html' file to use with test
txt <- "<span class=ft6>kids, and kids in your community, in DIY projects. </span>
<span class=ft6>kids, and kids in your community, in DIY projects. </span>
<span class=ft6>kids, and kids in your community, in DIY projects. </span>"
writeLines(txt, "file1.html")
# Now read files - in this case only one
html.files <- list.files(pattern = ".html")
html.files
# Loop through the list of files
retval <- lapply(html.files, function(x) {
in.lines <- readLines(x, n = -1)
# Replace non-breaking space with space
out.lines <- gsub(" "," ", in.lines)
# Write out the corrected lines to a new file
writeLines(out.lines, paste("new_", x, sep = ""))
})
Upvotes: 1