Reputation: 11
I am looking at a novel and want to search for the appearance of characters' names throughout the book Some characters go by different names. For example, the character "Sissy Jupe" goes by "Sissy" and "Jupe". I want to combine two rows of word counts into one so I can see the tally for "Sissy Jupe".
I've looked at using sum, rbind, merge, and other approaches using the message boards, but nothing seems to work. Lots of great examples, but they aren't working.
library(tidyverse)
library(gutenbergr)
library(tidytext)
ht <- gutenberg_download(786)
ht_chap <- ht %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE))))
tidy_ht <- ht_chap %>%
unnest_tokens(word, text) %>%
mutate(word = str_extract(word, "[a-z']+")) # preserves online letters; removes _)
ht_count <- tidy_ht %>%
group_by(chapter) %>%
count(word, sort = TRUE) %>%
ungroup %>%
complete(chapter, word,
fill = list(n = 0))
gradgrind <- filter(ht_count, word == "gradgrind")
bounderby <- filter (ht_count, word == "bounderby")
sissy <- filter (ht_count, word == "sissy")
## TEST
sissy_jupe <- ht_count %>%
filter(word %in% c("sissy", "jupe"))
I want a single "word" item called "sissy_jupe" that tallies the n by chapter. This is close, but not it.
# A tibble: 76 x 3
chapter word n
<int> <chr> <dbl>
1 0 jupe 0
2 0 sissy 1
3 1 jupe 0
4 1 sissy 0
5 2 jupe 5
6 2 sissy 9
7 3 jupe 3
8 3 sissy 1
9 4 jupe 1
10 4 sissy 0
# … with 66 more rows
Upvotes: 1
Views: 380
Reputation: 1456
Welcome to stackoverflow Tom. Here's an idea:
Basically, (1) find "sissy" or "jupe" in tidied tibble and replace with "sissy_jupe", (2) create ht_count as you did, (3) print results:
library(tidyverse)
library(gutenbergr)
library(tidytext)
ht <- gutenberg_download(786)
ht_chap <- ht %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE))))
tidy_ht <- ht_chap %>%
unnest_tokens(word, text) %>%
mutate(word = str_extract(word, "[a-z']+")) # preserves online letters; removes _)
# NEW CODE START
tidy_ht <- tidy_ht %>%
mutate(word = str_replace_all(word, "sissy|jupe", replacement = "sissy_jupe"))
# END NEW CODE
ht_count <- tidy_ht %>%
group_by(chapter) %>%
count(word, sort = TRUE) %>%
ungroup %>%
complete(chapter, word,
fill = list(n = 0))
# NEW CODE
sissy_jupe <- ht_count %>%
filter(str_detect(word, "sissy_jupe"))
# END
... produces ...
# A tibble: 38 x 3
chapter word n
<int> <chr> <dbl>
1 0 sissy_jupe 1
2 1 sissy_jupe 0
3 2 sissy_jupe 14
4 3 sissy_jupe 4
5 4 sissy_jupe 1
6 5 sissy_jupe 5
7 6 sissy_jupe 20
8 7 sissy_jupe 7
9 8 sissy_jupe 2
10 9 sissy_jupe 38
# ... with 28 more rows
Don't forget to upvote / click on the checkmark if any of our solutions helped you (feedback = better coders).
Upvotes: 0
Reputation: 575
The below code should get you the needed output.
library(tidyverse)
df %>% group_by(chapter) %>%
mutate(n = sum(n),
word = paste(word, collapse="_")) %>%
distinct(chapter, .keep_all = T)
Upvotes: 1