french_fries
french_fries

Reputation: 1

Group values in column based on common words

I have a dataframe:

ID    message
1     request body: <?xml version="2.0",<code> dwfkjn34241
2     request body: <?xml version="2.0",<code> jnwg3425
3     request body: <?xml version="2.0", <PlatCode>, <code> qwefn2
4     received an error
5     <MarkCheckMSG>
6     received an error

I want to extract groups of values in column based on common words. So, first three rows in message column can be considered as same group, though they are little bit different. Fourth and sixth as members of same groups. How could i group those values i column message using words and structural similarity criterion for that? What is a good method for that? The dataframe in example is given for example. So, im more interested in methods suiting the idea of problem, than regular expressions based solution for example

Upvotes: 1

Views: 243

Answers (1)

ekoam
ekoam

Reputation: 8844

Perhaps try a k-medoids clustering analysis with a string distance measure?

library(cluster)
library(stringdist)

find_medoids <- function(x, k_from, method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)) {
  diss <- stringdist::stringdistmatrix(x, x, method = method, weight = weight)
  dimnames(diss) <- list(x, x)
  trials <- lapply(
    seq(from = k_from, to = length(unique(x))), 
    function(i) cluster::pam(diss, i, diss = TRUE)
  )
  sel <- which.max(vapply(trials, `[[`, numeric(1L), c("silinfo", "avg.width")))
  trials[[sel]]
}

map_cluster <- function(x, med_obj) {
  unname(med_obj$clustering[x])
}

Output

> map_cluster(df$message, find_medoids(df$message, 2, "cosine"))
[1] 1 1 1 2 3 2

For your real data, you may have to adjust some parameters such as the string distance method (the example above used cosine distance).

Upvotes: 2

Related Questions