Reputation: 1159
I currently have two data.table tables. The first looks like:
> data1
Person Date
1 2007-1-1
2 2007-1-3
3 2007-1-9
4 2007-1-17
5 2007-1-30
The second looks like:
> data2
Person Date
1 2007-1-2
1 2007-1-3
1 2007-1-5
2 2007-1-4
2 2007-1-6
2 2007-1-7
2 2007-1-8
3 2007-1-19
4 2007-1-19
4 2007-1-25
5 2007-2-28
5 2007-3-5
I would like to have:
Person Date Between
1 2007-1-1 1
2 2007-1-3 4
3 2007-1-9 0
4 2007-1-17 2
5 2007-1-30 2
Here, I would like to look at the first person in the first data.table and look at the gap in the date between the 1st person and 2nd, so that the gap is: 2007-1-1 to 2007-1-3. Then, I would like to do a search on the second by person, and return how many dates in the second data.table corresponding to person 1 is between 2007-1-1 to 2007-1-3. Here, there was only 1 instance, so we put 1 in between.
For the second case, there are four instances in the second data set with dates between 2007-1-3 and 2007-1-9, so the Between column takes on the value 4. For the last instance, Person 5, we have that 2007-2-31 and 2007-3-5 are two dates past the value for Person 5 in the first table: 2007-1-30.
I have been able to write a for-loop for this,
vector.data <- rep(NA, 5)
for(i in 1:5){
index <- which(data1$date == data2$date)
data1[index,]
}
However, my main concern is the speed. I would like to do this for a data set that is to the order of 100 millions rows. Hence, I was wondering if there was a data.table solution or some other fast solution for this. Thank you!
Upvotes: 2
Views: 156
Reputation: 17432
This seemed to do the trick, and should be relatively fast:
> dt2$MaxDate = dt1[dt2$Person + 1, "Date"]
> dt2$MinDate = dt1[dt2$Person, "Date"]
> dt2[dt2$Person == max(dt2$Person),]$MaxDate = Sys.Date() #Last person can be any time
> dt2$IsBetween = with(dt2, Date > MinDate & Date < MaxDate)
So here's what the table looks like now:
> dt2
Person Date MaxDate MinDate IsBetween
1 1 2007-01-02 2007-01-03 2007-01-01 TRUE
2 1 2007-01-03 2007-01-03 2007-01-01 FALSE
3 1 2007-01-05 2007-01-03 2007-01-01 FALSE
4 2 2007-01-04 2007-01-09 2007-01-03 TRUE
5 2 2007-01-06 2007-01-09 2007-01-03 TRUE
6 2 2007-01-07 2007-01-09 2007-01-03 TRUE
7 2 2007-01-08 2007-01-09 2007-01-03 TRUE
8 3 2007-01-19 2007-01-17 2007-01-09 FALSE
9 4 2007-01-19 2007-01-30 2007-01-17 TRUE
10 4 2007-01-25 2007-01-30 2007-01-17 TRUE
11 5 2007-02-28 2014-09-17 2007-01-30 TRUE
12 5 2007-03-05 2014-09-17 2007-01-30 TRUE
Use tapply
to group results:
> dt1$Between = tapply(dt2$IsBetween, dt2$Person, sum)
> dt1
Person Date Between
1 1 2007-01-01 1
2 2 2007-01-03 4
3 3 2007-01-09 0
4 4 2007-01-17 2
5 5 2007-01-30 2
I used base data.frame rather than data.table because the identical column names would make scoping confusing. In this case I think performance should be fine
Upvotes: 1