indra_patil
indra_patil

Reputation: 283

Grouping similar texts in R

I got employee designation data having so much unique values. I want to club together many forms of one designation together like ('Senior Manager', 'Sr. Manager', 'Sen manager', 'Snr Manager' etc). Also this data is having typo mistakes as well.

What will be the best technique for clubbing many designations into one using R.

Is clustering best way to solve this issue or can some other technique help better to solve my problem.

I tried 'euclidean distance' and k-means but none gave satisfactory results.

library(tm)
library(data.table)
library(SparseM)
library(cluster)
    data <- readLines('RDATA.txt')
    head(data)

enter image description here

        data <- data[1:50]
        source <- VectorSource(data)
        corpus <- Corpus(source)
        corpus <- tm_map(corpus, content_transformer(tolower))
        corpus <- tm_map(corpus, removeNumbers)
        corpus <- tm_map(corpus, removePunctuation)
        corpus <- tm_map(corpus, stripWhitespace)
        corpus <- tm_map(corpus, removeWords, stopwords('english'))
    dtm = DocumentTermMatrix(corpus,
                             control = list(
                               wordLengths=c(4, 15),

                             ))
    m  <- as.matrix(dtm)
    distMatrix <- dist(m, method="euclidean")
    groups <- hclust(distMatrix,method="ward.D" )
    groups2 <-cutree(groups, k=10)
    clus_data <-cbind(data,groups2)
    clus_data

Upvotes: 1

Views: 710

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

This cannot work this way.

Just consider "dog", "fog". These words are similar, but you don't want them to be clustered. It's not just a typo.

Because of this, you cannot use an unsupervised method like clustering. You need something trained on language, typical spelling mistakes, and maybe phonetics.

Upvotes: 1

Related Questions