Reputation: 105
my problem is that I can't really get my problem down in words which makes it hard to google it, so I am forced to ask you. I hope you will shed light on my issue:
I got a data.frame like this:
6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1
As you noticed, in the first column I got 0
repeating two times, 1
two times and so one. What I would like to do is get get all the corresponging values for one number, say 0
, in the second columns (in this example 7
and 2
). Preferably in data.frame
.
I know the attempt with df$V2[which(df$V1==0)]
, however since the first column might have over 100 rows I can't really use this. Do you guys have a good solution?
Maybe some words regarding the background of this question: I need to process this data, i.e. get the mean of the second column for all 0
's in the first columns, or get min/max values.
Regards
Upvotes: 1
Views: 92
Reputation: 886948
Using data.table
library(data.table)
setDT(df)[, list(mean=mean(V2), max= max(V2), min=min(V2)), by = V1]
Upvotes: 1
Reputation: 105
Thanks all for your replies. I decided to go for the dplyr
solution posted by teucer and eipi10. Since I have a third (and maybe even a fourth) column, this solution seems to be pretty easy to use (just adding V3
to group_by
).
Since some are asking what's wrong with df$V2[which(df$V1==0)]
: I maybe was a bit unclear when saying "rows", was I actually meant was "values". Let's assume I had n
distinct values in the first column, I would have to use the command n
times for all distinct values and store the n
resulting vectors.
Upvotes: 0
Reputation: 174788
Using your data (with some temporary names attached)
txt <- "6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1"
df <- read.table(text = txt)
names(df) <- paste0("Var", seq_len(ncol(df)))
Coerce the first column to be a factor
df <- transform(df, Var1 = factor(Var1))
Then you can use aggregate()
with a nice formula interface
aggregate(Var2 ~ Var1, data = df, mean)
aggregate(Var2 ~ Var1, data = df, max)
aggregate(Var2 ~ Var1, data = df, min)
(eg:
> aggregate(Var2 ~ Var1, data = df, mean)
Var1 Var2
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
) or using the default interface
with(df, aggregate(Var2, list(Var1), FUN = mean))
> with(df, aggregate(Var2, list(Var1), FUN = mean))
Group.1 x
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
But the output is nicer from the formula interface.
Upvotes: 1
Reputation: 7248
First, what exactly is the issue with the solution you suggest? Is it a question of efficiency? Frankly the code you present is close to optimal [1].
For the general case, you're probably looking at a split-apply-combine action, to apply a function to subsets of the data based on some differentiator. As @teucer points out, dplyr
(and it's ancestor, plyr
) are designed for exactly this, as is data.tables
. In vanilla R, you would tend to use by
or aggregate
(or split
and sapply
for more advanced usage) for the same task. For example, to compute group means, you would do
by(df$V2, df$V1, mean)
or
aggregate(df, list(type=df$V1), mean)
Or even
sapply(split(df$V2, df$V1), mean)
[1] The code can be simplified to df$V2[df$V1 == 0]
or df[df$V1 == 0,]
as well.
Upvotes: 0
Reputation: 6238
Here a solution using dplyr
df %>% group_by(V1) %>% summarize(ME=mean(V2))
Upvotes: 2