Reputation: 10199

subscript out of bounds error in document-term matrix

I am doing text mining in following data, but I get following error at the end

Error in `[.simple_triplet_matrix`(dtm, 1:10, 1:10) : 
  subscript out of bounds

can you help me address this error.

library(rvest)  
library(tm)
library(snowball)
wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRanking")    
html_nodes(wiki_url, "#content")    
job <- html_table(html_nodes(wiki_url, "table")[[1]])   
head(job)   

#'  
#' ## Step 1: make a VCorpus object 
#'  
#'  
jobCorpus<-VCorpus(VectorSource(job[, 10])) 
#'  
#'  
#' ## Step 2: clean the VCorpus object  
#'  
#'  
jobCorpus<-tm_map(jobCorpus, tolower)   
for(j in seq(jobCorpus)){   
  jobCorpus[[j]] <- gsub("_", " ", jobCorpus[[j]])  
}   
#   
#   
jobCorpus<-tm_map(jobCorpus, removeWords, stopwords("english")) 
jobCorpus<-tm_map(jobCorpus, removePunctuation) 
jobCorpus<-tm_map(jobCorpus, stripWhitespace)   
jobCorpus<-tm_map(jobCorpus, PlainTextDocument) 
jobCorpus<-tm_map(jobCorpus, stemDocument)  
#
#   
# build document-term matrix    
#   
# Term Document Matrix (TDM) objects (`tm::DocumentTermMatrix`) contain a sparse term-document matrix or document-term matrix and attribute weights of the matrix.  
#   
# First make sure that we got a clean VCorpus object    
#   
jobCorpus[[1]]$content  
#   
#   
# Then we can start to build the DTM and reassign labels to the `Docs`. 

    
dtm<-DocumentTermMatrix(jobCorpus)  
dtm 
dtm$dimnames$Docs<-as.character(1:200)  
inspect(dtm[1:10, 1:10]) ###<-- error happens from here 

#' Let's subset the `dtm` into top 30 jobs and bottom 100 jobs. 
    
    
dtm_top30<-dtm[1:30, ]  
dtm_bot100<-dtm[101:200, ]

Upvotes: 0

Answers (2)

Pete

Reputation: 1

Alternatively to the answer offered by @phiver, after "head(job)" convert the jobs to "list"....

jobs <- as.list(job$Description) jobCorpus <- VCorpus(VectorSource(jobs))

....

Upvotes: 0

phiver

Reputation: 23608

2 issues. First, the use of tolower in this way strips the corpus of too much info. Second, you should use DataframeSource instead of VectorSource. With VectorSource as you use it, you only load 1 document with 200 lines, instead of 200 documents with a line each.

Code below works, I start from where you have created the job data.frame:

#you need the columns doc_id and text, you could rename 2 columns in job as well. 
# instead of doc_id as a doc_# you could also take the job title column
job_for_corpus <- data.frame(doc_id = paste0("doc_", job$Index),
                             text = job$Description, stringsAsFactors = FALSE)

# no need for loop, just use gsub on data.frame column
job_for_corpus$text <- gsub("_", " ", job_for_corpus$text)

# create corpus
jobCorpus <- VCorpus(DataframeSource(job_for_corpus))

# clean text
jobCorpus <- tm_map(jobCorpus, content_transformer(tolower))   
jobCorpus <- tm_map(jobCorpus, removeWords, stopwords("english")) 
jobCorpus <- tm_map(jobCorpus, removePunctuation) 
jobCorpus <- tm_map(jobCorpus, stripWhitespace)   
jobCorpus <- tm_map(jobCorpus, stemDocument)  


jobCorpus[[1]]$content  
[1] "research design develop maintain softwar system along hardwar develop medic scientif industri purpos"

# create document term matrix
dtm <- DocumentTermMatrix(jobCorpus)  

inspect(dtm[1:10, 1:10]) 
<<DocumentTermMatrix (documents: 10, terms: 10)>>
Non-/sparse entries: 2/98
Sparsity           : 98%
Maximal term length: 7
Weighting          : term frequency (tf)
Sample             :
        Terms
Docs     16wheel abnorm access accid accord account accur achiev act activ
  doc_1        0      0      0     0      0       0     0      0   0     0
  doc_10       0      0      0     0      0       0     0      0   0     0
  doc_2        0      0      0     0      0       0     0      0   0     0
  doc_3        0      0      0     1      0       0     0      0   0     0
  doc_4        0      0      0     0      0       0     0      0   0     0
  doc_5        0      0      0     0      0       0     0      0   0     0
  doc_6        0      0      0     0      0       0     0      0   0     0
  doc_7        0      0      0     0      0       0     0      0   0     0
  doc_8        0      0      0     0      1       0     0      0   0     0
  doc_9        0      0      0     0      0       0     0      0   0     0

# rest of your code

Upvotes: 1

subscript out of bounds error in document-term matrix

Answers (2)

Related Questions