Reputation: 171
I'm trying to create a corpus of articles from LexisNexis with the tm
-package.
The articles have been exported from LexisNexis as .html and are parsed into R with the tm.plugin.lexisnexis
-package like so:
> library("tm")
> library("tm.plugin.lexisnexis")
> src <- LexisNexisSource("~/Desktop/lexisnexis.html")
Following the instructions in the tm.plugin.lexisnexis
-documentation, I then create a corpus using the tm
-package, like so:
> data <- Corpus(src, readerControl = list(language = NA))
Error in getNodeSet(tree, "//div[@class = 'c3']/p[@class = 'c1']/span[@class = 'c4']")[[1]] :
subscript out of bounds
What does this error mean, and how do I fix it?
Example html-data: link
Upvotes: 0
Views: 412
Reputation: 524
I'm the author of the package. It's currently broken as the format used by LexisNexis is undocumented. I'll try to fix it, but if anybody proposes a patch, it will happen sooner. :-)
Upvotes: 1