Reputation: 423
I want to get the sum of the unique values in a column per year after satisfying some conditions.
Here is my data from dput:
structure(list(key = structure(c(1L, 1L, 4L, 2L, 3L, 4L, 2L,
3L, 5L, 5L, 8L, 6L, 7L, 8L, 6L, 7L), .Label = c("1992_10_18_0",
"1992_10_18_12", "1992_10_18_18", "1992_10_18_6", "1993_10_18_0",
"1993_10_18_12", "1993_10_18_18", "1993_10_18_6"), class = "factor"),
RR = c(43.25, 43.25, 43.25, 43.25, 43.25, 43.25, 43.25, 43.25,
43.25, 43.25, 43.25, 43.25, 43.25, 43.25, 43.25, 43.25),
dist = c(1000.23361607017, 694.022935174544, 748.618896699399,
812.290633745208, 869.896619169459, 1136.88564181537,
1058.59136791648,
975.756885299645, 1000.23361607017, 694.022935174544,
748.618896699399,
812.290633745208, 869.896619169459, 1136.88564181537,
1058.59136791648,
975.756885299645), Year = c(1992L, 1992L, 1992L, 1992L, 1992L,
1992L, 1992L, 1992L, 1993L, 1993L, 1993L, 1993L, 1993L, 1993L,
1993L, 1993L)), class = "data.frame", row.names = c(NA, -16L
))
What I want:
There are four columns in the data: key, RR, dist, and Year.
I want to get the sum of RR based on the unique "key" values per year such that the "dist" is less than or equal to 1100.
What I have so far:
I am processing multiple files so the script is like this:
dat<-read.csv("test_dat.csv",header=T,stringsAsFactors=FALSE)
dat2<-dat[which(dat$dist <= 1100),]
dat3<-as.data.frame(cbind(dat2$RR,dat2$Year))
colnames(dat3)<-c("RR","Year")
agg<-aggregate(.~Year,dat3,sum,na.rm=T)
write.csv(agg,file="test.csv",row.names=T)
Any idea on how I can do this in R? I'll appreciate any help.
Upvotes: 1
Views: 6002
Reputation: 388992
One way using dplyr
could be to filter
the dist
values less than 1100 and keep only unique values for key
and sum
RR
and dist
columns.
library(dplyr)
df %>%
group_by(Year) %>%
filter(dist <= 1100 & !duplicated(key)) %>%
summarise(RR = sum(RR), dist = sum(dist))
To count distinct values, we can use n_distinct
df %>%
filter(dist <= 1100) %>%
group_by(Year) %>%
summarise(n = n_distinct(key))
Upvotes: 2
Reputation: 1473
You can accomplish this via the aggregate
function as you've used, in combination with the function unique
:
agg <- aggregate(key ~ Year, data=subset(dat, dist <= 1100), FUN=function(x) length(unique(x)))
Or, in full context:
dat<-read.csv("test_dat.csv",header=T,stringsAsFactors=FALSE)
agg <- aggregate(key ~ Year, data=subset(dat, dist <= 1100), FUN=function(x) length(unique(x)))
write.csv(agg,file="test.csv",row.names=T)
In this example, the output generated is:
Year key
1 1992 4
2 1993 4
Upvotes: 2