Caroline
Caroline

Reputation: 101

Summarise data frame ignoring repetition

I have a data frame in which there are repetitions of entries in one column. I want to summarize the other columns based on the that one column. I wish the summary to consider each unique entry and not the total count when making the summary. For example in the data frame example below, if i want to answer the question on how many people surveyed are young,midage and old? "RefID" 1-1 is taken as a count of 1 in summarising "ageclass"=young and not interpreted as a count of 5.

RefID   Altitude    Sex ageclass
1-1 Low F   young
1-1 Low F   young
1-1 Low F   young
1-1 Low F   young
1-1 Low F   young
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-2 Low F   midage
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-3 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-4 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-5 Low F   old
1-7 Low F   old
1-7 Low F   old
1-7 Low F   old
1-7 Low F   old
1-8 Low F   old
1-8 Low F   old
1-9 Low F   old
1-9 Low F   old
1-9 Low F   old

Thank You.

Upvotes: 3

Views: 307

Answers (3)

Sacha Epskamp
Sacha Epskamp

Reputation: 47582

With subset you make a subset of the data and with duplicated you get a logical vector indicating if a value already occured in a vector. First a small sample dataset:

df <- data.frame(
   ID=rep(1:5,each=5),
   attitude="low",
   sex=c(rep("F",10),rep("M",15)),
   age=c(rep("young",5),rep("middle",10),rep("old",10))
   )

Then you can make a subset in which only the first time each ID is entered is recorded:

df.sub <- subset(df,!duplicated(df$ID))

Then you can summarize:

> summary(df.sub$age)
middle    old  young 
     2      2      1 

Upvotes: 0

Joris Meys
Joris Meys

Reputation: 108583

To get unique entries in a dataframe, see ?uniqe :

Data <- unique(Mydata)

You can use by :

by(Data,Data$ageclass,summary)

See also ?summary to understand the outcome. If you are interested in counts, you can use table ,eg :

table(Data$RefID,Data$ageclass)

or for a summary :

margin.table(table(Data$RefID,Data$ageclass),margin=2)

EDIT : you'll have to be a bit careful, as unique() takes the unique rows. If you have both a male and a female having refID 1-1 , then you'll still count it twice. But I presume that won't be the case in your data. If you really want to make sure, you can do :

with(unique(Data[c(1,4)]),margin.table(table(RefID,ageclass),margin=2))

or take the plyr solution mentioned here.

Upvotes: 2

Prasad Chalasani
Prasad Chalasani

Reputation: 20282

The plyr package is useful for this. E.g. you could do:

> require(plyr)
> ddply( df, .(ageclass), summarise, Num = length(unique(RefID)))
  ageclass Num
1   midage   1
2      old   6
3    young   1

Upvotes: 2

Related Questions