Tom Liam Lynch
Tom Liam Lynch

Reputation: 11

How to Combine Multiple Rows Into One Using TidyText

I am looking at a novel and want to search for the appearance of characters' names throughout the book Some characters go by different names. For example, the character "Sissy Jupe" goes by "Sissy" and "Jupe". I want to combine two rows of word counts into one so I can see the tally for "Sissy Jupe".

I've looked at using sum, rbind, merge, and other approaches using the message boards, but nothing seems to work. Lots of great examples, but they aren't working.

library(tidyverse) 
library(gutenbergr)
library(tidytext)

ht <- gutenberg_download(786)

ht_chap <- ht %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE))))

tidy_ht <- ht_chap %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) # preserves online letters; removes _)

ht_count <- tidy_ht %>%
  group_by(chapter) %>%
  count(word, sort = TRUE) %>%
  ungroup %>%
  complete(chapter, word,
           fill = list(n = 0)) 

gradgrind <- filter(ht_count, word == "gradgrind")
bounderby <- filter (ht_count, word == "bounderby")
sissy <- filter (ht_count, word == "sissy")

## TEST
sissy_jupe <- ht_count %>% 
  filter(word %in% c("sissy", "jupe"))

I want a single "word" item called "sissy_jupe" that tallies the n by chapter. This is close, but not it.

# A tibble: 76 x 3
   chapter word      n
     <int> <chr> <dbl>
 1       0 jupe      0
 2       0 sissy     1
 3       1 jupe      0
 4       1 sissy     0
 5       2 jupe      5
 6       2 sissy     9
 7       3 jupe      3
 8       3 sissy     1
 9       4 jupe      1
10       4 sissy     0
# … with 66 more rows

Upvotes: 1

Views: 380

Answers (2)

Marian Minar
Marian Minar

Reputation: 1456

Welcome to stackoverflow Tom. Here's an idea:

Basically, (1) find "sissy" or "jupe" in tidied tibble and replace with "sissy_jupe", (2) create ht_count as you did, (3) print results:

library(tidyverse) 
library(gutenbergr)
library(tidytext)

ht <- gutenberg_download(786)

ht_chap <- ht %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE))))

tidy_ht <- ht_chap %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) # preserves online letters; removes _)

# NEW CODE START
tidy_ht <- tidy_ht %>%
  mutate(word = str_replace_all(word, "sissy|jupe", replacement = "sissy_jupe"))
# END NEW CODE

ht_count <- tidy_ht %>%
  group_by(chapter) %>%
  count(word, sort = TRUE) %>%
  ungroup %>%
  complete(chapter, word,
           fill = list(n = 0))

# NEW CODE
sissy_jupe <- ht_count %>% 
  filter(str_detect(word, "sissy_jupe"))
# END

... produces ...

# A tibble: 38 x 3
   chapter word           n
     <int> <chr>      <dbl>
 1       0 sissy_jupe     1
 2       1 sissy_jupe     0
 3       2 sissy_jupe    14
 4       3 sissy_jupe     4
 5       4 sissy_jupe     1
 6       5 sissy_jupe     5
 7       6 sissy_jupe    20
 8       7 sissy_jupe     7
 9       8 sissy_jupe     2
10       9 sissy_jupe    38
# ... with 28 more rows

Don't forget to upvote / click on the checkmark if any of our solutions helped you (feedback = better coders).

Upvotes: 0

Theo
Theo

Reputation: 575

The below code should get you the needed output.

library(tidyverse)
df %>% group_by(chapter) %>% 
  mutate(n = sum(n),
         word = paste(word, collapse="_")) %>% 
  distinct(chapter, .keep_all = T)

Upvotes: 1

Related Questions