Reputation: 1
I have a dataset that has a date object (one date per individual).
I'd like to get the sum of all possible pairs that meet pre-specified criteria. For example, I want the total number of pairs that could be "matched" based on the fact that their dates were within, say, 15 days of each other:
df <- data.frame("ID"=c(1:5), "Date"=c("2005-01-05","2005-01-08","2005-01-21","2005-01-22","2005-02-04"))
df
ID1 matches with ID2, ID2 matches with ID3, ID2 matches with ID4 ID3 matches with ID4, ID3 matches with ID5 ID4 matches with ID5
All I want for output is the sum of all possible pairs (in this case, n=6 possible pairs).
I've played with ddply and aggregate functions quite a bit for this problem, but really can't nail down where I'm going wrong. I suspect it has something to do with the fact that I have date objects. I'll spare you all of my elementary attempted, unsuccessful solutions.
And no, this is not homework. Somewhat new to R and this is part of a larger cluster analysis project I am working on.
Upvotes: 0
Views: 70
Reputation: 2636
df <- data.frame("ID"=c(1:5), "Date"=c("2005-01-05","2005-01-08","2005-01-
21","2005-01-22","2005-02-04"))
df$Date <- as.Date(df$Date) #coerce to Date
#create a square matrix; each col contains same dates
m1 <- matrix(df$Date,nrow=nrow(df),ncol=nrow(df))
m2 <- t(m1) #transpose
m1
m2
#subtract the two matrices; and subset lower triangle of result
diff <- (m1-m2)[lower.tri(m1)]
sum(abs(diff) <= 15) #count those that meet criteria
Upvotes: 1
Reputation: 19544
df <- data.frame("ID"=c(1:5), "Date"=c("2005-01-05","2005-01-08","2005-01-21","2005-01-22","2005-02-04"))
df$Date <- as.POSIXct(df$Date)
df$MaxDate <- df$Date+15*24*3600
library(sqldf)
result = sqldf("select a.ID, b.ID from df a inner join df b
on a.Date between b.Date and b.MaxDate and a.ID<>b.ID")
Upvotes: 0
Reputation: 614
df <- data.frame("ID"=c(1:5), "Date"=c("2005-01-05","2005-01-08","2005-01-21","2005-01-22","2005-02-04"))
# Convert to a date field
df$Date <- as.Date(df$Date)
library(dplyr)
# Create an index that will be used to prevent recomparing pairs that have already been counted
df$myid <- seq(1:length(df$Date))
group_by(df,Date) %>% summarise(my.x=length(Date[abs(difftime(Date,df$Date[myid:nrow(df)],units="days"))<=15])-1)
Upvotes: 0