Reputation: 1063
While using R, I am often interested in performing operations on a data.frame in which I summarize a variable by a group, and then want to add those summary values back into the data.frame. This is most easily shown by example:
myDF <- data.frame(A = runif(5), B = c("A", "A", "A", "B", "B"))
myDF$Total <- with(myDF, by(A, B, sum))[myDF$B]
myDF$Proportion <- with(myDF, A / Total)
which produces:
A B Total Proportion
1 0.5272734 A 1.7186369 0.3067975
2 0.5105128 A 1.7186369 0.2970452
3 0.6808507 A 1.7186369 0.3961574
4 0.2892025 B 0.6667133 0.4337734
5 0.3775108 B 0.6667133 0.5662266
This trick -- essentially getting a vector of named values, and "spreading" or "stretching" them across the relevant rows by group -- generally works, although class(myDF$Total)
is "array"
unless I put the by()
inside of a c()
.
I am wondering:
dplyr
? Maybe there is a Hadley-approved verb operation (like mutate, arrange, etc.) about which I am unaware. I know that it is easy to summarise()
, but I often need to put those summaries back into the data.frame.Upvotes: 2
Views: 215
Reputation: 193527
Here's a "less hacky" way to do this with base R.
set.seed(1)
myDF <- data.frame(A = runif(5), B = c("A", "A", "A", "B", "B"))
within(myDF, {
Total <- ave(A, B, FUN = sum)
Proportion <- A/Total
})
# A B Proportion Total
# 1 0.2655087 A 0.2193406 1.210486
# 2 0.3721239 A 0.3074170 1.210486
# 3 0.5728534 A 0.4732425 1.210486
# 4 0.9082078 B 0.8182865 1.109890
# 5 0.2016819 B 0.1817135 1.109890
In "dplyr" language, I guess you're looking for mutate
:
myDF %>%
group_by(B) %>%
mutate(Total = sum(A), Proportion = A/Total)
# Source: local data frame [5 x 4]
# Groups: B
#
# A B Total Proportion
# 1 0.2655087 A 1.210486 0.2193406
# 2 0.3721239 A 1.210486 0.3074170
# 3 0.5728534 A 1.210486 0.4732425
# 4 0.9082078 B 1.109890 0.8182865
# 5 0.2016819 B 1.109890 0.1817135
From the "Introduction to dplyr" vignette, you would find the following description:
As well as selecting from the set of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of
mutate()
.dplyr::mutate()
works the same way asplyr::mutate()
and similarly tobase::transform()
. The key difference betweenmutate()
andtransform()
is that mutate allows you to refer to columns that you just created.
Also, since you've tagged this "data.table", you can "chain" commands together in "data.table" quite easily to do something like:
DT <- data.table(myDF)
DT[, Total := sum(A), by = B][, Proportion := A/Total][]
Upvotes: 11