user2880936
user2880936

Reputation: 23

R: Parsing group of html files with loop

The following code works for individual .html files:

doc <- htmlParse("New folder/1-4.html")
plain.text <- xpathSApply(doc, "//td", xmlValue)
plain.text <- gsub("\n", "", plain.text)
gregexpr("firstThing", plain.text)
firstThing <- substring(plain.text[9], 41, 50)
gregexpr(secondThing, plain.text)
secondThing <- substring(plain.text[7], 1, 550)

But the following loop does not and gives me the error:

XML content does not seem to be XML

file.names <-  dir(path = "New folder")

for(i in 1:length(file.names)){
doc <- htmlParse(file.names[i])
plain.text <- xpathSApply(doc, "//td", xmlValue)
gsub("\n", "", plain.text)
firstThing[i] <- substring(plain.text[9], 41, 50)
secondThing[i] <- substring(plain.text[7], 1, 550)
  }

I'm simply trying to extract the information (as I've been able to do in the first batch of code), and create a vector of information.

Any ideas on how to resolve this issue?

Upvotes: 2

Views: 539

Answers (1)

Konrad Rudolph
Konrad Rudolph

Reputation: 545598

Two things. First, your paths were wrong. To fix this, use:

filenames = dir(path = "New folder", full.names = TRUE)

Secondly, a better way than filling two variables inside a for loop is to generate structured data in a list function:

result = lapply(filenames, function (filename) {
    doc = htmlParse(filename)
    plain_text = xpathSApply(doc, "//td", xmlValue)
    c(first = substring(plain_text[9], 41, 50),
      second = substring(plain_text[7], 1, 550))
})

Now result is a list of elements, where each element is a vector with names first and second.

A few other remarks:

  • Be wary of dots in variable names — S3 uses dots in names to determine the class of a generic method. Using dots for anything else in variable names causes confusion and should be avoided.

  • The gsub statement in your loop has no effect.

Upvotes: 0

Related Questions