Reputation: 37
Imagine I have the following tibble df
:
id doc doc_word_count
-------------------------------------------
1 Lorem ipsum dolor... 1439
2 Lorem ipsum dolor... 10234
3 Lorem ipsum dolor... 2000
4 Lorem ipsum dolor... 15034
5 Lorem ipsum dolor... 11000
where doc_word_count
measures the number of words in doc
. What I would like to do is split the doc
column into 500 (but this number is arbitrary) words per row. The new tibble df_split
should look something like this:
id doc doc_word_count
-------------------------------------------
1 Lorem ipsum dolor... 500
1 labore et dolore... 500
1 totam rem aperiam... 439
2 ... 500
... ... 500
... ... ...
If there are not 500 words left in the last chunk, then it should just store as many words as there are left. I have looked at str_split
and this StackOverflow post but neither seems relevant here because I am not using a pattern or a fixed character width to split the string.
Upvotes: 0
Views: 55
Reputation: 131
You can use tidytext::unnest_tokens()
, which essentially extracts words from a string and pivots the data frame to one word per row. From there, you can use the %/%
operator to create new groupings and recombine the words into a single string.
suppressPackageStartupMessages({
library(dplyr)
library(tidytext)
library(stringi)
library(stringr)})
df <- tibble::tribble(~'id', ~'doc', ~'doc_word_count',
1, stringr::word(paste0(stringi::stri_rand_lipsum(1000), collapse = ' '), start = 1, end = 1439), 1439,
2, stringr::word(paste0(stringi::stri_rand_lipsum(1000), collapse = ' '), start = 1, end = 10234), 10234,
3, stringr::word(paste0(stringi::stri_rand_lipsum(1000), collapse = ' '), start = 1, end = 2000), 2000)
head(df)
#> # A tibble: 3 x 3
#> id doc doc_word_count
#> <dbl> <chr> <dbl>
#> 1 1 Lorem ipsum dolor sit amet, litora sollicitudin enim eu.~ 1439
#> 2 2 Lorem ipsum dolor sit amet, sed viverra amet velit ut ve~ 10234
#> 3 3 Lorem ipsum dolor sit amet, auctor convallis tristique v~ 2000
df_split <- df %>%
tidytext::unnest_tokens(word, doc) %>%
dplyr::group_by(id) %>%
dplyr::mutate(new_grp = ((row_number()-1) %/% 500)) %>%
dplyr::group_by(id, new_grp) %>%
dplyr::summarize(doc_word_count = n(),
doc = paste0(word, collapse = ' ')) %>%
dplyr::ungroup() %>%
dplyr::select(id, doc, doc_word_count)
#> `summarise()` regrouping output by 'id' (override with `.groups` argument)
head(df_split)
#> # A tibble: 6 x 3
#> id doc doc_word_count
#> <dbl> <chr> <int>
#> 1 1 lorem ipsum dolor sit amet litora sollicitudin enim eu i~ 500
#> 2 1 semper ullamcorper fames congue metus elementum condimen~ 500
#> 3 1 tincidunt magnis vehicula amet elementum quisque eu vita~ 439
#> 4 2 lorem ipsum dolor sit amet sed viverra amet velit ut vel~ 500
#> 5 2 non arcu netus aptent imperdiet lobortis eros in nulla i~ 500
#> 6 2 sem amet mattis sed feugiat ut arcu amet sed pellentesqu~ 500
Upvotes: 1