Reputation: 101
I have a data frame in which there are repetitions of entries in one column. I want to summarize the other columns based on the that one column. I wish the summary to consider each unique entry and not the total count when making the summary. For example in the data frame example below, if i want to answer the question on how many people surveyed are young,midage and old? "RefID" 1-1 is taken as a count of 1 in summarising "ageclass"=young and not interpreted as a count of 5.
RefID Altitude Sex ageclass
1-1 Low F young
1-1 Low F young
1-1 Low F young
1-1 Low F young
1-1 Low F young
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-2 Low F midage
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-3 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-4 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-5 Low F old
1-7 Low F old
1-7 Low F old
1-7 Low F old
1-7 Low F old
1-8 Low F old
1-8 Low F old
1-9 Low F old
1-9 Low F old
1-9 Low F old
Thank You.
Upvotes: 3
Views: 307
Reputation: 47582
With subset
you make a subset of the data and with duplicated
you get a logical vector indicating if a value already occured in a vector. First a small sample dataset:
df <- data.frame(
ID=rep(1:5,each=5),
attitude="low",
sex=c(rep("F",10),rep("M",15)),
age=c(rep("young",5),rep("middle",10),rep("old",10))
)
Then you can make a subset in which only the first time each ID is entered is recorded:
df.sub <- subset(df,!duplicated(df$ID))
Then you can summarize:
> summary(df.sub$age)
middle old young
2 2 1
Upvotes: 0
Reputation: 108583
To get unique entries in a dataframe, see ?uniqe :
Data <- unique(Mydata)
You can use by :
by(Data,Data$ageclass,summary)
See also ?summary
to understand the outcome. If you are interested in counts, you can use table
,eg :
table(Data$RefID,Data$ageclass)
or for a summary :
margin.table(table(Data$RefID,Data$ageclass),margin=2)
EDIT :
you'll have to be a bit careful, as unique()
takes the unique rows. If you have both a male and a female having refID 1-1 , then you'll still count it twice. But I presume that won't be the case in your data. If you really want to make sure, you can do :
with(unique(Data[c(1,4)]),margin.table(table(RefID,ageclass),margin=2))
or take the plyr
solution mentioned here.
Upvotes: 2
Reputation: 20282
The plyr
package is useful for this. E.g. you could do:
> require(plyr)
> ddply( df, .(ageclass), summarise, Num = length(unique(RefID)))
ageclass Num
1 midage 1
2 old 6
3 young 1
Upvotes: 2