erc
erc

Reputation: 10123

Apply function only to certain level of factor?

I have a data frame like so:

df <- structure(list(year = c(1990, 1990, 1990, 1990, 1990, 1990, 1990, 
1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1991, 1991, 1991, 
1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 
1991), group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), 
    value = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 
    13L, 14L, 15L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
    15L, 16L, 17L, 18L, 19L)), .Names = c("year", "group", "value"
), row.names = c(NA, -30L), class = "data.frame")


   > df
   year group value
1  1990     A     1
2  1990     A     2
3  1990     A     3
4  1990     A     4
5  1990     A     5
6  1990     A     6
7  1990     B     7
8  1990     B     8
9  1990     B     9
10 1990     B    10
11 1990     B    11
12 1990     B    12
13 1990     B    13
14 1990     B    14
15 1990     B    15
16 1991     A     5
17 1991     A     6
18 1991     A     7
19 1991     A     8
20 1991     A     9
21 1991     A    10
22 1991     A    11
23 1991     A    12
24 1991     A    13
25 1991     A    14
26 1991     B    15
27 1991     B    16
28 1991     B    17
29 1991     B    18
30 1991     B    19

I need to apply a function for each year (I intend to do that with plyr and summarise) but only on the factor level with the most rows (A or B). Is there a way to automatically select this level (A or B) for each year?

df2 <- ddply(df, .(year), summarise, result="some operation on longest level"))

desired output:

> df2
   year group value result
1  1990     B     7     5
2  1990     B     8     4
3  1990     B     9     5
4  1990     B    10     3
5  1990     B    11     3
6  1990     B    12     8
7  1990     B    13    11
8  1990     B    14     7  
9  1990     B    15     2
10 1991     A     5    10
11 1991     A     6    13
12 1991     A     7     9
13 1991     A     8     7
14 1991     A     9     6
15 1991     A    10     1
16 1991     A    11    15 
17 1991     A    12     5
18 1991     A    13     5
19 1991     A    14     2

Upvotes: 3

Views: 411

Answers (4)

talat
talat

Reputation: 70256

this might be another approach with dplyr

library(dplyr)

df <- df %.% group_by(year,group) %.% mutate(count = n()) %.% ungroup()
df <- df %.% group_by(year) %.% filter(count %in% max(count)) %.% mutate(result = sqrt(value))
df$count <- NULL

since i am not sure what function you want to apply to result I used sqrt(value) as in @rbatt's answer

Upvotes: 3

MrFlick
MrFlick

Reputation: 206197

Sorry, I don't use plyr myself, but here's how i might do it with base functions. Perhaps that will inspire a plyr solution for you.

#find largest groups for each year
maxgroups <- tapply(df$group, df$year, function(x) which.max(table(x)))
#create group names
maxpairs <- paste(names(maxgroups),levels(df$group)[maxgroups], sep=".")

#helper function
ifnotin<-function(val,set,ifnotin) {out<-val; out[!val%in%set]<-ifnotin; droplevels(out)}
#new factor indicating best group
tgroups <- ifnotin(interaction(df$year, df$group), maxpairs, NA)

#now transform the best groups by adding year to result (or whatever transformation you need to do)
transform(df, value=ifelse(!is.na(tgroups), value+year, value))

I wasn't sure if your transformation need to know what group/year it was for or not. If you just needed to know if it was in a group that needed transformation you could skip the tgroups and just use

needstransform <- interaction(df$year, df$group) %in% maxpairs

but tgroups has NA values that would be good for summaries tapply(df$value, droplevels(tgroups), mean) and such

Upvotes: 1

rbatt
rbatt

Reputation: 4807

This is what I came up with:

df2 <- ddply(
        df, 
        .(year), 
        summarise, 
        result=sqrt(
            value[group==names(which.max(table(df$group)))]
        )
    )

Upvotes: 0

Thomas
Thomas

Reputation: 44525

I don't think this is a very good answer because it's super obfuscated (and it doesn't use your desired plyr approach), but maybe it will stimulate someone else's thinking:

Basically, you just need to know which values of group you want to look at for each year. Let's say you figure that out and store those values (in the same order as splits of the original data by year) in a variable called m, then you can mapply some function that subsets each split (of the data by year) by group and then does whatever other calculations you want.

do.call(rbind, mapply(function(x,y) { 
                          tmp <- x[x$group==y,]
                          #fun(tmp) # apply your function to the relevant subset
                      }, split(df,df$year), m, SIMPLIFY=FALSE))

I thought of three different ways you could generate m. Here they are:

m <- with(df, levels(group)[apply(table(group, year), 2, which.max)])

m <- levels(df$group)[sapply(split(df, df$year), function(x) which.max(sapply(split(x, x$group), nrow)))]

m <- with(df, levels(group)[apply(tapply(year, list(group, year), length),2,which.max)])

Upvotes: 0

Related Questions