amunategui
amunategui

Reputation: 1220

Transform column of strings with word cluster values

I am doing some basic NLP work in R. I have two data sets and want to replace the words in one with the cluster value of each word from the other.

The first data set holds sentences and the second one the cluster value for each word (assume that every word in first data set has a cluster value):

original_text_df <- read.table(text="Text
'this is some text'
'this is more text'", header=T, sep="") 

cluster_df <- read.table(text="Word Cluster
this 2
is 2 
some 3
text 4
more 3", header=T, sep="") 

This is the desired transformed output:

Text
"2 2 3 4"
"2 2 3 4"

Looking for an efficient solution as I have long sentences and many of them. Thanks!

Upvotes: 1

Views: 353

Answers (1)

Steven Beaupr&#233;
Steven Beaupr&#233;

Reputation: 21621

You could try something like this:

library(tidyr)
library(dplyr)
library(stringi)

df1 <- unnest(stri_split_fixed(original_text_df$Text, ' '), group) %>%
  group_by(x) %>% mutate(cluster = cluster_df$Cluster[cluster_df$Word %in% x]) 

Which gives:

#Source: local data frame [8 x 3]
#Groups: x
#
#  group    x cluster
#1    X1 this       2
#2    X1   is       2
#3    X1 some       3
#4    X1 text       4 
#5    X2 this       2
#6    X2   is       2
#7    X2 more       3
#8    X2 text       4

From there, to match your expected output, you could build a list of clusters for each group (sentence) using split() and reconstruct a data frame:

l <- split(df1$cluster, f = df1$group)
df2 <- data.frame(Text = do.call(rbind, lapply(l, paste0, collapse = " ")))

And you will get:

#      Text
#X1 2 2 3 4
#X2 2 2 3 4

You can refer to this pretty similar question I asked a few months ago showing lots of other suggestions.

Upvotes: 1

Related Questions