Seyma Kalay
Seyma Kalay

Reputation: 2859

Counting strings in R

I have a data-set as below. I would like to group by then count the number of strings. Many thanks in advance.

SO = c("Journal Of Business", "Journal Of Business", "Journal of Economy")

AU_UN = c("Dartmouth Coll;Wellesley Coll;Wellesley Coll",                                                                                             
          "Georgetown Univ;Fed Reserve Syst",
          "Georgetown Univ;Fed Reserve Syst")

df <- data.frame(SO, AU_UN);df

Expected Answer

Journal Of Business      Dartmouth Coll (1);Wellesley Coll (2);  Georgetown Univ (1);Fed Reserve Syst (1)
Journal of Economy       Georgetown Univ (1); Fed Reserve Syst (1)

Upvotes: 2

Views: 85

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269654

Use separate_rows to convert to long form, count the rows and convert back with summarize.

library(dplyr)
library(tidyr)

df %>% 
  separate_rows(AU_UN, sep = ";") %>% 
  count(SO, AU_UN) %>% 
  group_by(SO) %>% 
  summarize(AU_UN = paste(sprintf("%s (%d)", AU_UN, n), collapse=";"), .groups = "drop")

giving:

# A tibble: 2 x 2
  SO                  AU_UN                                                                         
  <chr>               <chr>                                                                         
1 Journal Of Business Dartmouth Coll (1);Fed Reserve Syst (1);Georgetown Univ (1);Wellesley Coll (2)
2 Journal of Economy  Fed Reserve Syst (1);Georgetown Univ (1)                 

Upvotes: 1

Till
Till

Reputation: 6628

Using base::strsplit() we can extract the "sub strings". strsplit() returns a list that contains a vector of the strings for each row. The new list-column or nested column can be unnested with tidyr::unnest(). To get the frequencies of each string for each journal, we use dplyr::count().

library(tidyverse)
df %>% 
  mutate(strings  = strsplit(AU_UN, ";")) %>% 
  unnest(strings) %>% 
  count(SO, strings)
#> # A tibble: 6 x 3
#>   SO                  strings              n
#>   <chr>               <chr>            <int>
#> 1 Journal Of Business Dartmouth Coll       1
#> 2 Journal Of Business Fed Reserve Syst     1
#> 3 Journal Of Business Georgetown Univ      1
#> 4 Journal Of Business Wellesley Coll       2
#> 5 Journal of Economy  Fed Reserve Syst     1
#> 6 Journal of Economy  Georgetown Univ      1

Upvotes: 2

Related Questions