Reputation: 87
Just learning R.
Given a data.frame
in R with two columns, one numeric and one categorical, how do I extract a portion of the data.frame
for usage?
str(ex0331)
'data.frame': 36 obs. of 2 variables:
$ Iron : num 0.71 1.66 2.01 2.16 2.42 ...
$ Supplement: Factor w/ 2 levels "Fe3","Fe4": 1 1 1 1 1 1 1 1 1 1 ...
Basically, I need to be able to operate on the two factors separately; i.e. I need the ability to individually determine length/mean/sd/etc of the Iron retention rate by Supplement
type (Fe3
or Fe4
).
What's the easiest way to accomplish this?
I'm aware of the by()
command. For example, the following gets some of what I need:
by(ex0331, ex0331$Supplement, summary)
ex0331$Supplement: Fe3
Iron Supplement
Min. :0.710 Fe3:18
1st Qu.:2.420 Fe4: 0
Median :3.475
Mean :3.699
3rd Qu.:4.472
Max. :8.240
------------------------------------------------------------
ex0331$Supplement: Fe4
Iron Supplement
Min. : 2.200 Fe3: 0
1st Qu.: 3.892 Fe4:18
Median : 5.750
Mean : 5.937
3rd Qu.: 6.970
Max. :12.450
But I need more flexibility. I need to apply axis
commands, for example, or log()
functions by group. I'm sure there's an easy way to do this; I just don't see it. All of the data.frame
manipulation documentation I've seen is for numerical rather than categorical variables.
Upvotes: 5
Views: 28999
Reputation: 47602
You can get a subset of your data by indexing or using subset
:
ex0331 <- data.frame( iron=rnorm(36), supplement=c("Fe3","Fe4"))
subset(ex0331, supplement=="Fe3")
subset(ex0331, supplement=="Fe4")
ex0331[ex0331$supplement=="Fe3",]
Or at once with split
, resulting in a list:
split(ex0331,ex0331$supplement)
Another thing you can do is use tapply
to split by a factor and then perform a function:
tapply(ex0331$iron,ex0331$supplement,mean)
Fe3 Fe4
-0.15443861 -0.01308835
The plyr
package can also be used, which has loads of useful functions. For example:
library(plyr)
daply(ex0331,.(supplement),function(x)mean(x[1]))
Fe3 Fe4
-0.15443861 -0.01308835
In response to edited question, you could get the log of iron per supplement with:
ex0331 <- data.frame( iron=abs(rnorm(36)), supplement=c("Fe3","Fe4"))
tapply(ex0331$iron,ex0331$supplement,log)
Or with plyr
:
library(plyr)
dlply(ex0331,.(supplement),function(x)log(x$iron))
Both returned in a list. I'm sure there is an easier way then the wrapper function in the plyr example though.
Upvotes: 3
Reputation: 20282
I'd recommend using ddply
function from the plyr
package, detailed doc is online:
> require(plyr)
> ddply( ex0331, .(Supplement), summarise,
mean = mean(Iron),
sd = sd(Iron),
len = length(Iron))
Supplement mean sd len
1 Fe3 -0.3749169 0.2827360 4
2 Fe4 0.1953116 0.7128129 6
Update.
To add a LogIron
column where each entry is the log()
of the Iron
value, you would simply use transform
:
> transform(ex0331, LogIron = log(Iron))
Iron Supplement LogIron
1 0.07185141 Fe3 -2.63315498
2 1.10367297 Fe3 0.09864368
3 0.48592428 Fe3 -0.72170246
4 0.20286918 Fe3 -1.59519393
5 0.80830682 Fe4 -0.21281357
Or, to create a summary that is the "mean of the log Iron values, per Supplement", you would do:
> ddply( ex0331, .(Supplement), summarise, meanLog = mean(log(Iron)))
Supplement meanLog
1 Fe3 -1.0062304
2 Fe4 0.2791507
Upvotes: 3