Reputation: 51
I have a dataframe that has one word per row, but I want to convert it using R to sentences/paragraphs per row based on the value of a the columns spk and id. Here is a the dataframe sample :
id word spk
123 hi 1
123 how 1
123 are 1
123 you 1
123 good 2
123 thank 2
123 you 2
123 Anything 1
123 I 1
123 can 1
123 do 1
123 nothing 2
123 for 2
123 now 2
789 what 1
789 is 1
789 the 1
789 issue 1
789 there 2
789 is 2
789 no 2
789 issue 2
789 now 2
789 thank 1
789 you 1
789 for 1
789 contacting1
789 us 1
Desired output
id word spk
123 hi how are you 1
123 good thank you 2
123 Anything I can do 1
123 nothing for now 2
789 what is the issue 1
789 there is no issue now 2
789 thank you for contacting us 1
Any help would be appreciated. Thank you
Upvotes: 2
Views: 780
Reputation: 28675
require(data.table)
setDT(df)
df[, .(word = paste(word, collapse = ' ')
, id = unique(id)
, spk = unique(spk))
, by = .(phrase = rleid(spk))]
# phrase word id spk
# 1: 1 hi how are you 123 1
# 2: 2 good thank you 123 2
# 3: 3 Anything I can do 123 1
# 4: 4 nothing for now 123 2
# 5: 5 what is the issue 789 1
# 6: 6 there is no issue now 789 2
# 7: 7 thank you for contacting us 789 1
In tidyverse this is
df %>%
mutate(phrase = data.table::rleid(spk)) %>%
group_by(phrase) %>%
summarise(id = unique(id),
words = paste(word, collapse = " "))
Upvotes: 3
Reputation: 47300
Here's a base version:
df$i <- cumsum(c(FALSE,!!diff(df$spk)))
aggregate(word ~ id + spk + i,df,paste,collapse= " ")[-3]
# id spk word
# 1 123 1 hi how are you
# 2 123 2 good thank you
# 3 123 1 Anything I can do
# 4 123 2 nothing for now
# 5 789 1 what is the issue
# 6 789 2 there is no issue now
# 7 789 1 thank you for contacting us
Upvotes: 0
Reputation: 10291
Another dplyr
approach, assuming your word
column is a character vector:
library(dplyr)
df %>%
mutate(group = cumsum(ifelse(df$spk != lag(df$spk, default = 0), 1, 0))) %>%
group_by(id, group) %>%
mutate(sentence = paste(word, collapse = " ")) %>%
ungroup %>%
select(-word, -group) %>%
distinct()
# A tibble: 7 x 3
id spk sentence
<int> <int> <chr>
1 123 1 hi how are you
2 123 2 good thank you
3 123 1 Anything I can do
4 123 2 nothing for now
5 789 1 what is the issue
6 789 2 there is no issue now
7 789 1 thank you for contacting us
Upvotes: 0
Reputation: 20463
Here's a tidyverse
approach that also borrows rleid
from the data.table
package:
library(tidyverse)
df <-
structure(list(id = c(123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L),
word = structure(c(8L, 9L, 2L, 21L, 7L, 16L, 21L, 1L, 10L, 3L, 5L, 14L, 6L, 15L, 20L, 11L, 17L, 12L, 18L, 11L, 13L, 12L, 15L, 16L, 21L, 6L, 4L, 19L),
.Label = c("Anything", "are", "can", "contacting", "do", "for", "good", "hi", "how", "I", "is", "issue", "no", "nothing", "now", "thank", "the", "there", "us", "what", "you"), class = "factor"),
spk = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -28L))
df %>%
mutate(spk_num = data.table::rleid(spk)) %>%
mutate(word = as.character(word)) %>%
group_by(id, spk, spk_num) %>%
nest() %>%
mutate(words = map(data, unlist),
words = map_chr(words, paste, collapse = " "))
# A tibble: 7 x 5
id spk spk_num data words
<int> <int> <int> <list> <chr>
1 123 1 1 <tibble [4 x 1]> hi how are you
2 123 2 2 <tibble [3 x 1]> good thank you
3 123 1 3 <tibble [4 x 1]> Anything I can do
4 123 2 4 <tibble [3 x 1]> nothing for now
5 789 1 5 <tibble [4 x 1]> what is the issue
6 789 2 6 <tibble [5 x 1]> there is no issue now
7 789 1 7 <tibble [5 x 1]> thank you for contacting us
Overview: First, make a spk_num
column since your desired output wishes to differentiate between speakers within the same id. Next, group_by
everything and nest
the data. Finally we'll map paste
with collapse across the whole output, but you'll first need to unlist the word data.
Upvotes: 1
Reputation: 639
You can do this using data.table with the function paste() by grouping your rows appropriately. The only problem is that you have no reference of which phrase of the conversation you're on, so first you have to create a new column that specifies this.
library(data.table)
#This function takes a vector of speakers, say, (1,1,1,2,2,1,1,2,2,2) and gives you which phrase of the conversation you're on (1,1,1,2,2,3,3,4,4,4)
class_phrase = function(spk){
phrase = 1
phrase_n = 1
for(i in 2:length(spk)){
if(spk[i] == spk[i-1])
{phrase[i] = phrase[i-1]} else
{phrase_n = phrase_n + 1
phrase[i] = phrase_n}
}
return(phrase)
}
#Now you use this function with your data.table, grouping by id:
dt[, phrase := class_phrase(spk), by = id]
#Having a phrase number, you can just paste everything together by grouping your original data.table by id and phrase
dt[, .(word = paste(word, collapse = ' '), spk = unique(spk)), by = .(id, phrase)]
Upvotes: 0