K. Am
K. Am

Reputation: 51

Concatenate multiples words each in separate row into one row using R

I have a dataframe that has one word per row, but I want to convert it using R to sentences/paragraphs per row based on the value of a the columns spk and id. Here is a the dataframe sample :

id  word    spk
123 hi       1
123 how      1
123 are      1
123 you      1
123 good     2
123 thank    2
123 you      2
123 Anything 1
123 I        1
123 can      1
123 do       1
123 nothing  2
123 for      2
123 now      2
789 what     1
789 is       1
789 the      1
789 issue    1
789 there    2
789 is       2
789 no       2
789 issue    2
789 now      2
789 thank    1
789 you      1
789 for      1
789 contacting1
789 us       1

Desired output

id  word                           spk
123 hi how are you                  1
123 good thank you                  2
123 Anything I can do               1
123 nothing for now                 2
789 what is the issue               1
789 there is no issue now           2
789 thank you for contacting us     1

Any help would be appreciated. Thank you

Upvotes: 2

Views: 780

Answers (5)

IceCreamToucan
IceCreamToucan

Reputation: 28675

require(data.table)
setDT(df)

df[, .(word = paste(word, collapse = ' ')
     , id   = unique(id)
     , spk  = unique(spk))
   , by = .(phrase = rleid(spk))]

#    phrase                        word  id spk
# 1:      1              hi how are you 123   1
# 2:      2              good thank you 123   2
# 3:      3           Anything I can do 123   1
# 4:      4             nothing for now 123   2
# 5:      5           what is the issue 789   1
# 6:      6       there is no issue now 789   2
# 7:      7 thank you for contacting us 789   1

In tidyverse this is

df %>%
  mutate(phrase = data.table::rleid(spk)) %>%
  group_by(phrase) %>%
  summarise(id = unique(id),
            words = paste(word, collapse = " "))

Upvotes: 3

moodymudskipper
moodymudskipper

Reputation: 47300

Here's a base version:

df$i <-  cumsum(c(FALSE,!!diff(df$spk)))
aggregate(word ~ id + spk + i,df,paste,collapse= " ")[-3]
#    id spk                        word
# 1 123   1              hi how are you
# 2 123   2              good thank you
# 3 123   1           Anything I can do
# 4 123   2             nothing for now
# 5 789   1           what is the issue
# 6 789   2       there is no issue now
# 7 789   1 thank you for contacting us

Upvotes: 0

Luke C
Luke C

Reputation: 10291

Another dplyr approach, assuming your word column is a character vector:

library(dplyr)

df %>%
  mutate(group = cumsum(ifelse(df$spk != lag(df$spk, default = 0), 1, 0))) %>%
  group_by(id, group) %>%
  mutate(sentence = paste(word, collapse = " ")) %>%
  ungroup %>%
  select(-word, -group) %>%
  distinct() 

# A tibble: 7 x 3
     id   spk sentence                   
  <int> <int> <chr>                      
1   123     1 hi how are you             
2   123     2 good thank you             
3   123     1 Anything I can do          
4   123     2 nothing for now            
5   789     1 what is the issue          
6   789     2 there is no issue now      
7   789     1 thank you for contacting us

Upvotes: 0

JasonAizkalns
JasonAizkalns

Reputation: 20463

Here's a tidyverse approach that also borrows rleid from the data.table package:

library(tidyverse)

df <- 
structure(list(id = c(123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L), 
               word = structure(c(8L, 9L, 2L, 21L, 7L, 16L, 21L, 1L, 10L, 3L, 5L, 14L, 6L, 15L, 20L, 11L, 17L, 12L, 18L, 11L, 13L, 12L, 15L, 16L, 21L, 6L, 4L, 19L), 
              .Label = c("Anything", "are", "can", "contacting", "do", "for", "good", "hi", "how", "I", "is", "issue", "no", "nothing", "now", "thank", "the", "there", "us", "what", "you"), class = "factor"), 
              spk = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -28L))

df %>%
  mutate(spk_num = data.table::rleid(spk)) %>%
  mutate(word = as.character(word)) %>%
  group_by(id, spk, spk_num) %>%
  nest() %>%
  mutate(words = map(data, unlist),
         words = map_chr(words, paste, collapse = " "))

# A tibble: 7 x 5
     id   spk spk_num data             words                      
  <int> <int>   <int> <list>           <chr>                      
1   123     1       1 <tibble [4 x 1]> hi how are you             
2   123     2       2 <tibble [3 x 1]> good thank you             
3   123     1       3 <tibble [4 x 1]> Anything I can do          
4   123     2       4 <tibble [3 x 1]> nothing for now            
5   789     1       5 <tibble [4 x 1]> what is the issue          
6   789     2       6 <tibble [5 x 1]> there is no issue now      
7   789     1       7 <tibble [5 x 1]> thank you for contacting us

Overview: First, make a spk_num column since your desired output wishes to differentiate between speakers within the same id. Next, group_by everything and nest the data. Finally we'll map paste with collapse across the whole output, but you'll first need to unlist the word data.

Upvotes: 1

Luis
Luis

Reputation: 639

You can do this using data.table with the function paste() by grouping your rows appropriately. The only problem is that you have no reference of which phrase of the conversation you're on, so first you have to create a new column that specifies this.

library(data.table)

#This function takes a vector of speakers, say, (1,1,1,2,2,1,1,2,2,2) and gives you which phrase of the conversation you're on (1,1,1,2,2,3,3,4,4,4)

class_phrase = function(spk){
  phrase = 1
  phrase_n = 1
  for(i in 2:length(spk)){
    if(spk[i] == spk[i-1])
      {phrase[i] = phrase[i-1]} else
      {phrase_n = phrase_n + 1
       phrase[i] = phrase_n}
  }
  return(phrase)
}

#Now you use this function with your data.table, grouping by id:

dt[, phrase := class_phrase(spk), by = id]

#Having a phrase number, you can just paste everything together by grouping your original data.table by id and phrase

dt[, .(word = paste(word, collapse = ' '), spk = unique(spk)), by = .(id, phrase)]

Upvotes: 0

Related Questions