Lyndz
Lyndz

Reputation: 423

Sum of unique values in a column

I want to get the sum of the unique values in a column per year after satisfying some conditions.

Here is my data from dput:

structure(list(key = structure(c(1L, 1L, 4L, 2L, 3L, 4L, 2L, 
3L, 5L, 5L, 8L, 6L, 7L, 8L, 6L, 7L), .Label = c("1992_10_18_0", 
"1992_10_18_12", "1992_10_18_18", "1992_10_18_6", "1993_10_18_0", 
"1993_10_18_12", "1993_10_18_18", "1993_10_18_6"), class = "factor"), 
 RR = c(43.25, 43.25, 43.25, 43.25, 43.25, 43.25, 43.25, 43.25, 
 43.25, 43.25, 43.25, 43.25, 43.25, 43.25, 43.25, 43.25), 
 dist = c(1000.23361607017, 694.022935174544, 748.618896699399, 
 812.290633745208, 869.896619169459, 1136.88564181537, 
 1058.59136791648, 
 975.756885299645, 1000.23361607017, 694.022935174544, 
 748.618896699399, 
 812.290633745208, 869.896619169459, 1136.88564181537, 
 1058.59136791648, 
 975.756885299645), Year = c(1992L, 1992L, 1992L, 1992L, 1992L, 
 1992L, 1992L, 1992L, 1993L, 1993L, 1993L, 1993L, 1993L, 1993L, 
1993L, 1993L)), class = "data.frame", row.names = c(NA, -16L
))

What I want:

There are four columns in the data: key, RR, dist, and Year.

I want to get the sum of RR based on the unique "key" values per year such that the "dist" is less than or equal to 1100.

What I have so far:

I am processing multiple files so the script is like this:

dat<-read.csv("test_dat.csv",header=T,stringsAsFactors=FALSE)

dat2<-dat[which(dat$dist <= 1100),]
dat3<-as.data.frame(cbind(dat2$RR,dat2$Year))
colnames(dat3)<-c("RR","Year")
agg<-aggregate(.~Year,dat3,sum,na.rm=T)

write.csv(agg,file="test.csv",row.names=T)

Any idea on how I can do this in R? I'll appreciate any help.

Upvotes: 1

Views: 6002

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388992

One way using dplyr could be to filter the dist values less than 1100 and keep only unique values for key and sum RR and dist columns.

library(dplyr)

df %>%
  group_by(Year) %>%
  filter(dist <= 1100 & !duplicated(key)) %>%
  summarise(RR = sum(RR), dist = sum(dist))

To count distinct values, we can use n_distinct

df %>%
  filter(dist <= 1100) %>%
  group_by(Year) %>%
  summarise(n = n_distinct(key))

Upvotes: 2

DMR
DMR

Reputation: 1473

You can accomplish this via the aggregate function as you've used, in combination with the function unique:

agg <- aggregate(key ~ Year, data=subset(dat, dist <= 1100), FUN=function(x) length(unique(x)))

Or, in full context:

dat<-read.csv("test_dat.csv",header=T,stringsAsFactors=FALSE)
agg <- aggregate(key ~ Year, data=subset(dat, dist <= 1100), FUN=function(x) length(unique(x)))
write.csv(agg,file="test.csv",row.names=T)

In this example, the output generated is:

  Year key
1 1992   4
2 1993   4

Upvotes: 2

Related Questions