Reputation: 2382
I've obtained a datased, which needs to be concatenated/aggregated by specific criteria. I know how to do this for numeric variables, but this time, I need to compute something from string vectors, for example i have:
V1 V2
1 YYY
1 MMMMMM
1 UUUU
2 YY
2 UUU
.
.
.
I am trying to compute the MU % on specific V1, so my result set would look something like:
V1 V2
1 75%
2 60%
I've been fiddling around with the aggregate function, but I cannot get it to at least paste join all data from V1, so
aggregate(V1~V2, data=x,FUN=paste(x))
obviously doesn't work for me.
Upvotes: 1
Views: 111
Reputation: 887611
An option with data.table
library(data.table)
setDT(dat)[, list(V2=sum(nchar(V2)-nchar(gsub("M|U", "",
V2)))/sum(nchar(V2))), V1]
# V1 V2
#1: 1 0.7692308
#2: 2 0.6000000
Upvotes: 0
Reputation: 17611
Here's a way straight from the original dataset:
library(stringi)
stack(
tapply(d$V2, d$V1,
function(ii) sum(stri_count_regex(ii, "M|U")) /
sum(stri_count_regex(ii, "."))))
# values ind
#1 0.7692308 1
#2 0.6000000 2
To use the aggregate statement you just need a few changes:
d2 <- aggregate(V2 ~ V1, data=d, function(ii) paste0(ii, collapse=""))
# no packages used in this solution:
d2$V2 <-
sapply(
strsplit(d2$V2, "", perl=TRUE),
function(ii) sum(grepl("M|U", ii))/length(ii))
# V1 V2
#1 1 0.7692308
#2 2 0.6000000
Or with the stri_count
function from the stringi
package, there's a nice shorter option:
d2 <- aggregate(V2~V1, data=d, function(ii) paste0(ii, collapse=""))
library(stringi)
d2$V2 <- stri_count_regex(d2$V2, "M|U") / nchar(d2$V2)
# V1 V2
#1 1 0.7692308
#2 2 0.6000000
Upvotes: 3
Reputation: 3615
Here's a dplyr
and stringr
solution
## Create the sample data
dat <- read.table(text = "V1 V2
1 YYY
1 MMMMMM
1 UUUU
2 YY
2 UUU", header = TRUE, stringsAsFactors = FALSE)
## Load the packages
library("dplyr")
library("stringr")
For each group in V1
, calculate the number of M's & U's out of the total number of characters:
dat %>%
group_by(V1) %>%
summarize(V2 = sum(str_count(V2, "M|U")) / sum(nchar(V2)))
## Source: local data frame [2 x 2]
## V1 V2
## (int) (dbl)
## 1 1 0.7692308
## 2 2 0.6000000
Upvotes: 4