Sarah
Sarah

Reputation: 463

Creating a table extracting the first letter in a string and counts in R

I am trying to extract the first letter of a string that are separated by commas, then counting how many times that letter appears. So an example of a column in my data frame looks like this:

test <- data.frame("Code" =  c("EKST, STFO", "EFGG", "SSGG, RRRR, RRFK", 
"RRRF"))

And I'd want a column added next to it that looks like this:

test2 <- data.frame("Code" =  c("EKST, STFO", "EFGG", "SSGG, RRRR, RRFK", 
"RRRF"), "Code_Count" = c("E1, S1", "E1", "S1, R2", "R1"))

The code count column extracts the first letter of the string and counts how many times that letter appears in that specific cell.

I looked into using strsplit to get the first letter in the column separated by commas, but I'm not sure how to attach the count of how many times that letter appears in the cell to it.

Upvotes: 2

Views: 675

Answers (1)

Andrew
Andrew

Reputation: 5138

Here is one option using base R. This splits the Code column on the comma (and at least one space), then tabulates the number of times the first letter appears, then pastes them back together into your desired output. It does sort the new column in alphabetical order (which doesn't match your output). Hope this helps!

test2$Coode_Count2 <- sapply(strsplit(test2$Code, ",\\s+"), function(x) {
  tab <- table(substr(x, 1, 1)) # Create a table of the first letters
  paste0(names(tab), tab, collapse = ", ") # Paste together the letter w/ the number and collapse them
} )

test2
              Code Code_Count Coode_Count2
1       EKST, STFO     E1, S1       E1, S1
2             EFGG         E1           E1
3 SSGG, RRRR, RRFK     S1, R2       R2, S1
4             RRRF         R1           R1

Here is a tidier, stringr/purrr solution that grabs the first letter of a word and does the same thing (instead of splitting the string)

library(purrr)
library(stringr)

map_chr(str_extract_all(test2$Code, "\\b[A-Z]{1}"), function(x) {
  tab <- table(x)
  paste0(names(tab), tab, collapse = ", ")
  } )

Data:

test2 <- data.frame("Code" =  c("EKST, STFO", "EFGG", "SSGG, RRRR, RRFK", 
                            "RRRF"), "Code_Count" = c("E1, S1", "E1", "S1, R2", "R1"))
test2[] <- lapply(test2, as.character) # factor to character

Upvotes: 4

Related Questions