Reputation: 1471
I am new to Python, I have created one term document matrix using R, I wanted to learn how I can use Python to create same.
I am reading text data from Description column available in data frame Res_Desc_Train. But not sure how can I use functionality of creating document term matrix in python it will be helpful if any document is available which help to learn.
Below is code I used in R.
docs <- Corpus(VectorSource(Res_Desc_Train$Description))
docs <-tm_map(docs,content_transformer(tolower))
#remove potentially problematic symbols
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ":")
docs <- tm_map(docs, toSpace, ";")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\(" )
docs <- tm_map(docs, toSpace, ")")
docs <- tm_map(docs, toSpace, ",")
docs <- tm_map(docs, toSpace, "_")
docs <- tm_map(docs, content_transformer(removeSpecialChars))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("en"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
#inspect(docs[440])
dataframe<-data.frame(text=unlist(sapply(docs, `[`, "content")), stringsAsFactors=F)
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
dtm <- DocumentTermMatrix(docs,control=list(stopwords=FALSE,wordLengths =c(2,Inf),tokenize = BigramTokenizer))
Weighteddtm <- weightTfIdf(dtm,normalize=TRUE)
mat.df <- as.data.frame(data.matrix(Weighteddtm), stringsAsfactors = FALSE)
mat.df <- cbind(mat.df, Res_Desc_Train$Group)
colnames(mat.df)[ncol(mat.df)] <- "Group"
Assignment.Distribution <- table(mat.df$Group)
Res_Desc_Train_Assign <- mat.df$Group
Assignment.Distribution <- table(mat.df$Group)
### Feature has different ranges, normalizing to bring ranges from 0 to 1
### Another way to standardize using z-scores
normalize <- function(x) {
y <- min(x)
z <- max(x)
temp <- x - y
temp1 <- (z - y)
temp2 <- temp / temp1
return(temp2)
}
#normalize(c(1,2,3,4,5))
num_col <- ncol(mat.df)-1
mat.df_normalize <- as.data.frame(lapply(mat.df[,1:num_col], normalize))
mat.df_normalize <- cbind(mat.df_normalize, Res_Desc_Train_Assign)
colnames(mat.df_normalize)[ncol(mat.df_normalize)] <- "Group"
Upvotes: 1
Views: 1397
Reputation: 183
Normally when you need to deal with text in python the best tool is NLTK. In your specific case, there is a specific python package that creates the term-document-matrix. This package is called Textmining.
Moreover, if you need to use regex you can use the re
package of python. Otherwise, you can use a tokenizer directly form NLTK.
Upvotes: 1