Reputation: 531
I have a dataframe with previously tokenized words that look like below. Replication code:
df <- data.frame (id = c("1", "2","3"),
text = c("['I', 'like', 'apple']", "['we', 'go', 'swimming']", "['ask', 'questions']")
)
:
id text
1 ["I", "like", "apple"]
2 ["we", "go", "swimming"]
3 ["ask", "questions"]
The original data frame was obtained in Python after preprocessing (including tokenizing) raw text data.
I'd like to merge these tokens back into a sentence so it would look like below
id text
1 I like apple
2 we go swimming
3 ask questions
I tried using the paste() function df$text_new<-paste(df$text, sep = " ")
, but it failed to work, still returning the same result.
Upvotes: 1
Views: 336
Reputation: 1038
You can separate()
then unite()
them with tidyr. You will have to provide a character vector long enough for each word in the longest sentence with into =
-- I used letters
to get 26 -- and then refer to the first and last (a:z
).
library(tidyr)
df <- data.frame (id = c("1", "2","3"),
text = c("['I', 'like', 'apple']", "['we', 'go', 'swimming']", "['ask', 'questions']"))
df %>%
separate(text, into = letters, fill = "right") %>%
unite(text, a:z, sep = " ", na.rm = TRUE)
#> id text
#> 1 1 I like apple
#> 2 2 we go swimming
#> 3 3 ask questions
Created on 2022-05-26 by the reprex package (v2.0.1)
Upvotes: 1