Reputation: 1946
My data.frame contains information on the movements completed by an individual and a string (of alpha characters) that represents these movements in a database. It is structured as follows:
MovementAnalysis <- structure(list(Strings = c("AaB", "cZhH", "Bb", "bAc"), Descriptor = c("Jog/ Stop/ Turn", "Change/ Shuffle/ Backwards/ Jump", "Turn/ Duck", "Duck/ Jog/ Change"), Person = c("Sally", "Sally", "Ben", "Ben")), .Names = c("Strings", "Descriptor", "Person"), row.names = c(NA, 4L), class = "data.frame")
I wish to capture the frequency of each alpha letter (for example: A, a, B, b) within all the Strings
for each Person
. There are 48 alpha upper and lower case letters. My actual data.frame contains the movements of 100 + individuals, so a quick solution to iterate over each individual would be ideal. As an example, my anticipated output would be:
Output <- structure(list(Person = c("Sally", "Sally", "Sally", "Sally", "Ben", "Ben", "Ben", "Ben"), Letter = c("A", "a", "B", "b", "A", "a", "B", "b"), Frequency = c(1, 1, 1, 0, 1, 0, 1, 2)), .Names = c("Person", "Letter", "Frequency"), row.names = c(NA, 8L), class = "data.frame")
Thank you!
Upvotes: 1
Views: 184
Reputation: 193687
Here's an option using cSplit_e
from my "splitstackshape" package. I've combined it with "magrittr" so that you can walk through the steps without having to store any intermediate objects or create a long nested expression.
The first option shows how to get the "wide" form, as described by @alistaire.
library(splitstackshape)
library(magrittr)
data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
.[, lapply(.SD, sum), by = Person] %>%
subset(select = grep("Person|_[AaBb]$", names(.)))
# Person Strings_a Strings_A Strings_b Strings_B
# 1: Sally 1 1 0 1
# 2: Ben 0 1 2 1
To go from the above to the long form, you just need to add a melt
line.
data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
.[, lapply(.SD, sum), by = Person] %>%
subset(select = grep("Person|_[AaBb]$", names(.))) %>%
melt(id.vars = "Person")
# Person variable value
# 1: Sally Strings_a 1
# 2: Ben Strings_a 0
# 3: Sally Strings_A 1
# 4: Ben Strings_A 1
# 5: Sally Strings_b 0
# 6: Ben Strings_b 2
# 7: Sally Strings_B 1
# 8: Ben Strings_B 1
It's not clear from your question, but if your restricting the data to just "A", "a", "B", and "b" was just for the purpose of illustration and you're actually interested in the full 48 options, then you can also omit the following line:
subset(select = grep("Person|_[AaBb]$", names(.)))
Upvotes: 0
Reputation: 989
Less wizardy than akrun's answer, but I think it works:
your.func <- function(data) {
require(dplyr)
bag.of.letters <- function(strings) {
concat.string <- paste(strings, collapse='')
all.chars.vec <- unlist(strsplit(concat.string,""))
result <- data.frame(table(factor(all.chars.vec,levels = c(letters,LETTERS))))
colnames(result) <- c("Letter","Frequency")
result[order(result[["Letter"]]),]
}
lapply(X = unique(data[["Person"]]),
FUN = function(n) {
strings = data %>% filter(Person == n) %>% .[["Strings"]]
data.frame(Person = n, bag.of.letters(strings))
}) %>% do.call(rbind,.)
}
your.func(MovementAnalysis)
If you want to have only letters with positive Frequency in your Letter
column, remove the factor(..., levels = c(letters,LETTERS))
part.
Upvotes: 1
Reputation: 887951
One option is using data.table
library(data.table)
df2 <- setDT(df1)[,list(Letter={
tmp <- unlist(strsplit(Strings, ''))
factor(tmp[tmp %in% c("A", "a", "B", "b")],
levels=c("A", "a", "B", "b"))}) , Person]
df2[, ind:="Frequency"]
dcast(df2, Person+Letter~ind, value.var="Letter", length, drop=FALSE)
# Person Letter Frequency
#1: Ben A 1
#2: Ben a 0
#3: Ben B 1
#4: Ben b 2
#5: Sally A 1
#6: Sally a 1
#7: Sally B 1
#8: Sally b 0
Upvotes: 1