ageil
ageil

Reputation: 171

R: Subscript out of bounds when using tm function Corpus on LexisNexis-data

I'm trying to create a corpus of articles from LexisNexis with the tm-package. The articles have been exported from LexisNexis as .html and are parsed into R with the tm.plugin.lexisnexis-package like so:

> library("tm")
> library("tm.plugin.lexisnexis")
> src <- LexisNexisSource("~/Desktop/lexisnexis.html")

Following the instructions in the tm.plugin.lexisnexis-documentation, I then create a corpus using the tm-package, like so:

> data <- Corpus(src, readerControl = list(language = NA))
Error in getNodeSet(tree, "//div[@class = 'c3']/p[@class = 'c1']/span[@class = 'c4']")[[1]] : 
  subscript out of bounds

What does this error mean, and how do I fix it?

Example html-data: link

Upvotes: 0

Views: 412

Answers (1)

Milan Bouchet-Valat
Milan Bouchet-Valat

Reputation: 524

I'm the author of the package. It's currently broken as the format used by LexisNexis is undocumented. I'll try to fix it, but if anybody proposes a patch, it will happen sooner. :-)

Upvotes: 1

Related Questions