How to join tokenized words back together in a column in R dataframe

Question

I have a dataframe with previously tokenized words that look like below. Replication code:

df <- data.frame (id  = c("1", "2","3"),
                  text = c("['I', 'like', 'apple']", "['we', 'go', 'swimming']", "['ask', 'questions']")
                  )

:

id   text
1   ["I", "like", "apple"]
2   ["we", "go", "swimming"]
3   ["ask", "questions"]

The original data frame was obtained in Python after preprocessing (including tokenizing) raw text data.
I'd like to merge these tokens back into a sentence so it would look like below

id   text
1   I like apple
2   we go swimming
3   ask questions

I tried using the paste() function df$text_new<-paste(df$text, sep = " "), but it failed to work, still returning the same result.

lhs · Accepted Answer

You can separate() then unite() them with tidyr. You will have to provide a character vector long enough for each word in the longest sentence with into = -- I used letters to get 26 -- and then refer to the first and last (a:z).

library(tidyr)

df <- data.frame (id  = c("1", "2","3"),
                  text = c("['I', 'like', 'apple']", "['we', 'go', 'swimming']", "['ask', 'questions']"))

df %>% 
  separate(text, into = letters, fill = "right") %>% 
  unite(text, a:z, sep = " ", na.rm = TRUE)

#>   id            text
#> 1  1    I like apple 
#> 2  2  we go swimming 
#> 3  3   ask questions

^{Created on 2022-05-26 by the reprex package (v2.0.1)}

How to join tokenized words back together in a column in R dataframe

Answers (1)

Related Questions