user2716568
user2716568

Reputation: 1946

How do I count the frequency of a character within a string, by a group?

My data.frame contains information on the movements completed by an individual and a string (of alpha characters) that represents these movements in a database. It is structured as follows:

MovementAnalysis <- structure(list(Strings = c("AaB", "cZhH", "Bb", "bAc"), Descriptor = c("Jog/ Stop/ Turn", "Change/ Shuffle/ Backwards/ Jump", "Turn/ Duck", "Duck/ Jog/ Change"), Person = c("Sally", "Sally", "Ben", "Ben")), .Names = c("Strings", "Descriptor", "Person"), row.names = c(NA, 4L), class = "data.frame")

I wish to capture the frequency of each alpha letter (for example: A, a, B, b) within all the Strings for each Person. There are 48 alpha upper and lower case letters. My actual data.frame contains the movements of 100 + individuals, so a quick solution to iterate over each individual would be ideal. As an example, my anticipated output would be:

Output <- structure(list(Person = c("Sally", "Sally", "Sally", "Sally", "Ben", "Ben", "Ben", "Ben"), Letter = c("A", "a", "B", "b", "A", "a", "B", "b"), Frequency = c(1, 1, 1, 0, 1, 0, 1, 2)), .Names = c("Person", "Letter", "Frequency"), row.names = c(NA, 8L), class = "data.frame")

Thank you!

Upvotes: 1

Views: 184

Answers (3)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

Here's an option using cSplit_e from my "splitstackshape" package. I've combined it with "magrittr" so that you can walk through the steps without having to store any intermediate objects or create a long nested expression.

The first option shows how to get the "wide" form, as described by @alistaire.

library(splitstackshape)
library(magrittr)

data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
  cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
  .[, lapply(.SD, sum), by = Person] %>%
  subset(select = grep("Person|_[AaBb]$", names(.)))
#    Person Strings_a Strings_A Strings_b Strings_B
# 1:  Sally         1         1         0         1
# 2:    Ben         0         1         2         1

To go from the above to the long form, you just need to add a melt line.

data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
  cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
  .[, lapply(.SD, sum), by = Person] %>%
  subset(select = grep("Person|_[AaBb]$", names(.))) %>%
  melt(id.vars = "Person")
#    Person  variable value
# 1:  Sally Strings_a     1
# 2:    Ben Strings_a     0
# 3:  Sally Strings_A     1
# 4:    Ben Strings_A     1
# 5:  Sally Strings_b     0
# 6:    Ben Strings_b     2
# 7:  Sally Strings_B     1
# 8:    Ben Strings_B     1  

It's not clear from your question, but if your restricting the data to just "A", "a", "B", and "b" was just for the purpose of illustration and you're actually interested in the full 48 options, then you can also omit the following line:

subset(select = grep("Person|_[AaBb]$", names(.)))

Upvotes: 0

Alexandre Halm
Alexandre Halm

Reputation: 989

Less wizardy than akrun's answer, but I think it works:

your.func <- function(data) {
    require(dplyr)
    bag.of.letters <- function(strings) {
        concat.string <- paste(strings, collapse='')
        all.chars.vec <- unlist(strsplit(concat.string,""))
        result <- data.frame(table(factor(all.chars.vec,levels = c(letters,LETTERS))))
        colnames(result) <- c("Letter","Frequency")
        result[order(result[["Letter"]]),]
    }
    lapply(X = unique(data[["Person"]]), 
           FUN = function(n) {
               strings = data %>% filter(Person == n) %>% .[["Strings"]]
               data.frame(Person = n, bag.of.letters(strings))
           }) %>% do.call(rbind,.)
}

your.func(MovementAnalysis)

If you want to have only letters with positive Frequency in your Letter column, remove the factor(..., levels = c(letters,LETTERS)) part.

Upvotes: 1

akrun
akrun

Reputation: 887951

One option is using data.table

library(data.table)
df2 <- setDT(df1)[,list(Letter={
   tmp <- unlist(strsplit(Strings, ''))
   factor(tmp[tmp %in% c("A", "a", "B", "b")], 
        levels=c("A", "a", "B", "b"))}) , Person]
df2[, ind:="Frequency"]
dcast(df2, Person+Letter~ind, value.var="Letter", length, drop=FALSE)
#   Person Letter Frequency
#1:    Ben      A         1
#2:    Ben      a         0
#3:    Ben      B         1
#4:    Ben      b         2
#5:  Sally      A         1
#6:  Sally      a         1
#7:  Sally      B         1
#8:  Sally      b         0

Upvotes: 1

Related Questions