Reputation: 10199
I am doing text mining in following data, but I get following error at the end
Error in `[.simple_triplet_matrix`(dtm, 1:10, 1:10) :
subscript out of bounds
can you help me address this error.
library(rvest)
library(tm)
library(snowball)
wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRanking")
html_nodes(wiki_url, "#content")
job <- html_table(html_nodes(wiki_url, "table")[[1]])
head(job)
#'
#' ## Step 1: make a VCorpus object
#'
#'
jobCorpus<-VCorpus(VectorSource(job[, 10]))
#'
#'
#' ## Step 2: clean the VCorpus object
#'
#'
jobCorpus<-tm_map(jobCorpus, tolower)
for(j in seq(jobCorpus)){
jobCorpus[[j]] <- gsub("_", " ", jobCorpus[[j]])
}
#
#
jobCorpus<-tm_map(jobCorpus, removeWords, stopwords("english"))
jobCorpus<-tm_map(jobCorpus, removePunctuation)
jobCorpus<-tm_map(jobCorpus, stripWhitespace)
jobCorpus<-tm_map(jobCorpus, PlainTextDocument)
jobCorpus<-tm_map(jobCorpus, stemDocument)
#
#
# build document-term matrix
#
# Term Document Matrix (TDM) objects (`tm::DocumentTermMatrix`) contain a sparse term-document matrix or document-term matrix and attribute weights of the matrix.
#
# First make sure that we got a clean VCorpus object
#
jobCorpus[[1]]$content
#
#
# Then we can start to build the DTM and reassign labels to the `Docs`.
dtm<-DocumentTermMatrix(jobCorpus)
dtm
dtm$dimnames$Docs<-as.character(1:200)
inspect(dtm[1:10, 1:10]) ###<-- error happens from here
#' Let's subset the `dtm` into top 30 jobs and bottom 100 jobs.
dtm_top30<-dtm[1:30, ]
dtm_bot100<-dtm[101:200, ]
Upvotes: 0
Views: 360
Reputation: 1
Alternatively to the answer offered by @phiver, after "head(job)" convert the jobs to "list"....
jobs <- as.list(job$Description) jobCorpus <- VCorpus(VectorSource(jobs))
....
Upvotes: 0
Reputation: 23608
2 issues. First, the use of tolower
in this way strips the corpus of too much info. Second, you should use DataframeSource
instead of VectorSource
. With VectorSource
as you use it, you only load 1 document with 200 lines, instead of 200 documents with a line each.
Code below works, I start from where you have created the job data.frame:
#you need the columns doc_id and text, you could rename 2 columns in job as well.
# instead of doc_id as a doc_# you could also take the job title column
job_for_corpus <- data.frame(doc_id = paste0("doc_", job$Index),
text = job$Description, stringsAsFactors = FALSE)
# no need for loop, just use gsub on data.frame column
job_for_corpus$text <- gsub("_", " ", job_for_corpus$text)
# create corpus
jobCorpus <- VCorpus(DataframeSource(job_for_corpus))
# clean text
jobCorpus <- tm_map(jobCorpus, content_transformer(tolower))
jobCorpus <- tm_map(jobCorpus, removeWords, stopwords("english"))
jobCorpus <- tm_map(jobCorpus, removePunctuation)
jobCorpus <- tm_map(jobCorpus, stripWhitespace)
jobCorpus <- tm_map(jobCorpus, stemDocument)
jobCorpus[[1]]$content
[1] "research design develop maintain softwar system along hardwar develop medic scientif industri purpos"
# create document term matrix
dtm <- DocumentTermMatrix(jobCorpus)
inspect(dtm[1:10, 1:10])
<<DocumentTermMatrix (documents: 10, terms: 10)>>
Non-/sparse entries: 2/98
Sparsity : 98%
Maximal term length: 7
Weighting : term frequency (tf)
Sample :
Terms
Docs 16wheel abnorm access accid accord account accur achiev act activ
doc_1 0 0 0 0 0 0 0 0 0 0
doc_10 0 0 0 0 0 0 0 0 0 0
doc_2 0 0 0 0 0 0 0 0 0 0
doc_3 0 0 0 1 0 0 0 0 0 0
doc_4 0 0 0 0 0 0 0 0 0 0
doc_5 0 0 0 0 0 0 0 0 0 0
doc_6 0 0 0 0 0 0 0 0 0 0
doc_7 0 0 0 0 0 0 0 0 0 0
doc_8 0 0 0 0 1 0 0 0 0 0
doc_9 0 0 0 0 0 0 0 0 0 0
# rest of your code
Upvotes: 1