sdgaw erzswer
sdgaw erzswer

Reputation: 2382

Concatenating with the aggregate function in R

I've obtained a datased, which needs to be concatenated/aggregated by specific criteria. I know how to do this for numeric variables, but this time, I need to compute something from string vectors, for example i have:

V1 V2
1 YYY
1 MMMMMM
1 UUUU
2 YY
2 UUU
.
.
. 

I am trying to compute the MU % on specific V1, so my result set would look something like:

V1 V2
1  75%
2  60%

I've been fiddling around with the aggregate function, but I cannot get it to at least paste join all data from V1, so

aggregate(V1~V2, data=x,FUN=paste(x)) 

obviously doesn't work for me.

Upvotes: 1

Views: 111

Answers (3)

akrun
akrun

Reputation: 887611

An option with data.table

library(data.table)
setDT(dat)[, list(V2=sum(nchar(V2)-nchar(gsub("M|U", "", 
                         V2)))/sum(nchar(V2))), V1]
#   V1        V2
#1:  1 0.7692308
#2:  2 0.6000000

Upvotes: 0

Jota
Jota

Reputation: 17611

Here's a way straight from the original dataset:

library(stringi)
stack(
tapply(d$V2, d$V1, 
  function(ii) sum(stri_count_regex(ii, "M|U")) / 
               sum(stri_count_regex(ii, "."))))
#     values ind
#1 0.7692308   1
#2 0.6000000   2

To use the aggregate statement you just need a few changes:

d2 <- aggregate(V2 ~ V1, data=d, function(ii) paste0(ii, collapse="")) 

# no packages used in this solution:
d2$V2 <- 
  sapply(
    strsplit(d2$V2, "", perl=TRUE),
    function(ii) sum(grepl("M|U", ii))/length(ii))
#  V1        V2
#1  1 0.7692308
#2  2 0.6000000

Or with the stri_count function from the stringi package, there's a nice shorter option:

d2 <- aggregate(V2~V1, data=d, function(ii) paste0(ii, collapse="")) 

library(stringi)
d2$V2 <- stri_count_regex(d2$V2, "M|U") / nchar(d2$V2)
#  V1        V2
#1  1 0.7692308
#2  2 0.6000000

Upvotes: 3

Kara Woo
Kara Woo

Reputation: 3615

Here's a dplyr and stringr solution

## Create the sample data
dat <- read.table(text = "V1 V2
1 YYY
1 MMMMMM
1 UUUU
2 YY
2 UUU", header = TRUE, stringsAsFactors = FALSE)

## Load the packages
library("dplyr")
library("stringr")

For each group in V1, calculate the number of M's & U's out of the total number of characters:

dat %>%
  group_by(V1) %>%
  summarize(V2 = sum(str_count(V2, "M|U")) / sum(nchar(V2)))

## Source: local data frame [2 x 2]

##      V1        V2
##   (int)     (dbl)
## 1     1 0.7692308
## 2     2 0.6000000

Upvotes: 4

Related Questions