Reputation: 835
I have a dataset as -
id date sector region price score
1 SWECEC2 2011-12-31 Finance R1 2.297936 -1.0000000
2 FRACPY1 2011-12-31 FMCG R1 10.500936 -0.8571429
3 FRAACO1 2011-12-31 Finance R1 19.228936 -0.7142857
4 GERAIM1 2011-12-31 Finance R1 4.936936 -0.5714286
5 GERAKX1 2011-12-31 Finance R1 8.140936 -0.5714286
6 NETAHI1 2011-12-31 FMCG R1 -7.229064 -0.5714286
7 UKIJLK1 2011-12-31 Finance R2 17.766936 -0.5714286
8 FRAAHJ1 2011-12-31 Finance R2 5.840936 -0.4285714
9 FRACLB1 2011-12-31 FMCG R2 -4.490064 -0.4285714
10 GERAIO1 2011-12-31 FMCG R2 2.338936 -0.4285714
I need to create a new column to this dataset that will categorize each row of the dataset for a group of date, sector and region as 1 if the score is less than the average score for that group of date, sector and region else 2. i.e. average of score for date=2011-12-32, sector=Finance, Region = R1 is -0.7142857 and average of score for date=2011-12-32, sector=FMCG, Region = R2 is -0.4285714 and so on then the new column for the first group should be like -
id date sector region price score Rating
SWECEC2 12/31/2011 Finance R1 2.297936 -1 1
FRAACO1 12/31/2011 Finance R1 19.228936 -0.7142857 2
GERAIM1 12/31/2011 Finance R1 4.936936 -0.5714286 2
GERAKX1 12/31/2011 Finance R1 8.140936 -0.5714286 2
I thought we could do this in ddply but I can't think of how to because I am not summarizing the data here, just creating a new column based on each group in a data.
Any suggestions?
Thanks!
The sample of my data set is here -
ds <- structure(list(id = structure(c(9L, 4L, 1L, 5L, 7L, 8L, 10L,
2L, 3L, 6L), .Label = c("FRAACO1", "FRAAHJ1", "FRACLB1", "FRACPY1",
"GERAIM1", "GERAIO1", "GERAKX1", "NETAHI1", "SWECEC2", "UKIJLK1"
), class = "factor"), date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "12/31/2011", class = "factor"),
sector = structure(c(1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L,
2L), .Label = c("Finance", "FMCG"), class = "factor"), region = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("R1", "R2"
), class = "factor"), price = c(2.297936, 10.500936, 19.228936,
4.936936, 8.140936, -7.229064, 17.766936, 5.840936, -4.490064,
2.338936), score = c(-1, -0.8571429, -0.7142857, -0.5714286,
-0.5714286, -0.5714286, -0.5714286, -0.4285714, -0.4285714,
-0.4285714)), .Names = c("id", "date", "sector", "region",
"price", "score"), class = "data.frame", row.names = c(NA, -10L
))
Upvotes: 1
Views: 257
Reputation: 611
Another way without loading libraries:
# Split data
ds.splitted <- split(ds, list(ds$date, ds$sector, ds$region))
# Function to create Ratings (input: df, output: vector of 1's and 2's)
myfun <- function(x){
x.mean.score <- mean(x$score)
ifelse(x$score < x.mean.score, 1, 2)
}
# get the ratings
lapply(ds.splitted, function(y) cbind(y, Rating = myfun(y)))
Or optionally it is possible to combine the output list into a data frame directly:
ds.ratings <- do.call("rbind",
lapply(ds.splitted, function(y) cbind(y, Rating = myfun(y))))
# to conserve the order in the original ds data frame as rowname:
rownames(ds.ratings) <- unlist(lapply(ds.splitted, rownames))
Upvotes: 1
Reputation: 27388
Here's an approach similar to @koshske's, using dplyr
:
library(dplyr)
ds.new <- merge(ds %.%
group_by(date, sector, region) %.%
summarise(mean = mean(score)), ds)
within(ds.new, Rating <- ifelse(score < mean, 1, 2))
# date sector region mean id price score Rating
# 1 12/31/2011 Finance R1 -0.7142857 SWECEC2 2.297936 -1.0000000 1
# 2 12/31/2011 Finance R1 -0.7142857 FRAACO1 19.228936 -0.7142857 2
# 3 12/31/2011 Finance R1 -0.7142857 GERAIM1 4.936936 -0.5714286 2
# 4 12/31/2011 Finance R1 -0.7142857 GERAKX1 8.140936 -0.5714286 2
# 5 12/31/2011 Finance R2 -0.5000000 UKIJLK1 17.766936 -0.5714286 1
# 6 12/31/2011 Finance R2 -0.5000000 FRAAHJ1 5.840936 -0.4285714 2
# 7 12/31/2011 FMCG R1 -0.7142857 FRACPY1 10.500936 -0.8571429 1
# 8 12/31/2011 FMCG R1 -0.7142857 NETAHI1 -7.229064 -0.5714286 2
# 9 12/31/2011 FMCG R2 -0.4285714 FRACLB1 -4.490064 -0.4285714 2
# 10 12/31/2011 FMCG R2 -0.4285714 GERAIO1 2.338936 -0.4285714 2
There's a good intro to dplyr
here. One of the neat things about it is that you can "chain" expressions with %.%
.
As @iugrina mentions in the comment below, the summarise
and merge
is in fact unnecessary; we can use mutate to achieve the same:
ds %.%
group_by(date,sector,region) %.%
mutate(Rating=ifelse(score < mean(score), 1, 2))
Upvotes: 2
Reputation: 66842
Try this:
library(plyr)
ddply(ds, .(date, sector, region), function(x) data.frame(x, rate = ifelse(x$score < mean(x$score), 1, 2)))
then you will get:
id date sector region price score rate
1 SWECEC2 12/31/2011 Finance R1 2.297936 -1.0000000 1
2 FRAACO1 12/31/2011 Finance R1 19.228936 -0.7142857 2
3 GERAIM1 12/31/2011 Finance R1 4.936936 -0.5714286 2
4 GERAKX1 12/31/2011 Finance R1 8.140936 -0.5714286 2
5 UKIJLK1 12/31/2011 Finance R2 17.766936 -0.5714286 1
6 FRAAHJ1 12/31/2011 Finance R2 5.840936 -0.4285714 2
7 FRACPY1 12/31/2011 FMCG R1 10.500936 -0.8571429 1
8 NETAHI1 12/31/2011 FMCG R1 -7.229064 -0.5714286 2
9 FRACLB1 12/31/2011 FMCG R2 -4.490064 -0.4285714 2
10 GERAIO1 12/31/2011 FMCG R2 2.338936 -0.4285714 2
Upvotes: 2