vk087
vk087

Reputation: 106

word frequency counter in r using n-gram

I would like to perform a certain operation which will transform the data in the provided format.

Input

colA                             colB
textA textB textC textD           m
textA textB                       n
textB textC                       p
textB textC textD                 q

Output

type    col_a              col_b(frequency)           col_c
unigram textA                        2                  m+n
unigram textB                        4                m+n+p+q
unigram textC                        3                 m+p+q
unigram textD                        2                  m+q
bigram  textA textB                  2                  m+n
bigram  textB textC                  3                 m+p+q
bigram  textC textD                  2                  m+q
trigram textA textB textC            1                   m
trigram textB textC textD            2                   m+q
fourgram textA textB textC textD     1                   m

Need to do this separately for unigram,bigram,trigram and fourgram and then rbind the results.

Upvotes: 1

Views: 209

Answers (1)

Sotos
Sotos

Reputation: 51592

Here is an idea

n_grams <- function(n) {
  unigrams1 <- unique(unlist(lapply(strsplit(df$colA, ' '), unique)))
  t <- apply(combn(unigrams1, n), 2, paste, collapse = ' ')
  t1 <- sapply(t, function(i) paste(df$colB[grepl(i, df$colA)], collapse = '+'))
  return(t1[sapply(t1, nchar)>0])
}
#testing the function

n_grams(1)
#    textA     textB     textC     textD 
#    "m+n" "m+n+p+q"   "m+p+q"     "m+q" 
n_grams(2)
#textA textB textB textC textC textD 
#      "m+n"     "m+p+q"       "m+q" 
n_grams(3)
#textA textB textC textB textC textD 
#              "m"             "m+q" 
n_grams(4)
#textA textB textC textD 
#                    "m" 

To construct your desired output, then

df1 <- data.frame(rbind(stack(n_grams(1)), stack(n_grams(2)), stack(n_grams(3)), stack(n_grams(4))))
df1$freq <- nchar(gsub('\\+', '', df1$values))
df1 <- df1[,c('ind', 'freq', 'values')]
df1
#                       ind freq  values
#1                    textA    2     m+n
#2                    textB    4 m+n+p+q
#3                    textC    3   m+p+q
#4                    textD    2     m+q
#5              textA textB    2     m+n
#6              textB textC    3   m+p+q
#7              textC textD    2     m+q
#8        textA textB textC    1       m
#9        textB textC textD    2     m+q
#10 textA textB textC textD    1       m

Upvotes: 3

Related Questions