Reputation: 283
I got employee designation data having so much unique values. I want to club together many forms of one designation together like ('Senior Manager', 'Sr. Manager', 'Sen manager', 'Snr Manager' etc). Also this data is having typo mistakes as well.
What will be the best technique for clubbing many designations into one using R.
Is clustering best way to solve this issue or can some other technique help better to solve my problem.
I tried 'euclidean distance' and k-means but none gave satisfactory results.
library(tm)
library(data.table)
library(SparseM)
library(cluster)
data <- readLines('RDATA.txt')
head(data)
data <- data[1:50]
source <- VectorSource(data)
corpus <- Corpus(source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
dtm = DocumentTermMatrix(corpus,
control = list(
wordLengths=c(4, 15),
))
m <- as.matrix(dtm)
distMatrix <- dist(m, method="euclidean")
groups <- hclust(distMatrix,method="ward.D" )
groups2 <-cutree(groups, k=10)
clus_data <-cbind(data,groups2)
clus_data
Upvotes: 1
Views: 710
Reputation: 77454
This cannot work this way.
Just consider "dog", "fog". These words are similar, but you don't want them to be clustered. It's not just a typo.
Because of this, you cannot use an unsupervised method like clustering. You need something trained on language, typical spelling mistakes, and maybe phonetics.
Upvotes: 1