Ernest Presley
Ernest Presley

Reputation: 171

subset based on date in a reference table

I have table1 as follows.

   StudentId        Date1         Lunch
   23433            2014-08-26    Yes
   233989           2014-08-18    No
   909978           2014-08-06    No
   777492           2014-08-11    Yes
   3987387          2014-08-26    No

I have another table, table2 which is as follows

Id  StudentId        Date2        Result_Nm
1   777492           2012.06.10   0.1
2   777492           2013.12.06   2.0
3   777492           2014.08.30   0.6
4   23433            2011.08.26   3.0
5   23433            2015.04.06   3.0
6   233989           2011.05.14   0.003
7   233989           2014.09.14   0.05
8   909978           2004-09-12   0.2
9   909978           2005-05-10   0.23
10  909978           2015-01-02   2.4
11  3987387          2014-10-06   3.5
12  3987387          2014-08-26   1.17

I want to retain only observations from table2 dataset where the Date2 values are less than the Date1 values for each StudentId. In other words it should contain these rows.

  Id  StudentId        Date2         Result_Nm
  1   777492           2012.06.10    0.1
  2   777492           2013.12.06    2.0
  4   23433            2011.08.26    3.0
  6   233989           2014.09.14    0.05
  8   909978           2004-09-12    0.2
  9   909978           2005-05-10    0.23
  12  3987387          2014-08-26    1.17

Observation 3 is excluded because the Date1 value for StudentId 777492 is 2014-08-11 and this value is less than 2014.08.30, similarly observations 5,7,10,11 so on. I have used subset before but this is little more challenging , need help.

Upvotes: 2

Views: 90

Answers (2)

Xinting WANG
Xinting WANG

Reputation: 1985

Here is my solution.

d1 <- read.table("d1.txt", header=T)
d2 <- read.table("d2.txt", header=T)
d3 <- merge(d1,d2, by="StudentId")
d3$Date1 <- strptime(d3$Date1, format="%Y-%m-%d")
library(stringr)
d3$Date2 <-str_c(str_sub(d3$Date2, 1L,4L),"-",str_sub(d3$Date2,6L,7L), "-", str_sub(d3$Date2,9L,10L))
d3$Date2 <- strptime(d3$Date2, format="%Y-%m-%d")
d3$mark <- d3$Date2-d3$Date1
d3 <- d3[d3$mark<1,]
d3 <- d3[,c(1,4,5,6)]

Upvotes: 0

akrun
akrun

Reputation: 887301

We can change the 'Date' columns to 'Date' class by using ymd from lubridate. It can take multiple formats (., -). Join the two dataset (left_join) by 'StudentId', remove the rows using filter and select the specific columns

library(lubridate) 
library(dplyr)
df2$Date2 <- ymd(df2$Date2)
df1$Date1 <- ymd(df1$Date1)

left_join(df2, df1, by='StudentId') %>% 
                     filter(Date2 <=Date1) %>% 
                     select(1:4)
#    Id StudentId      Date2 Result_Nm
#1  1    777492 2012-06-10     0.100
#2  2    777492 2013-12-06     2.000
#3  4     23433 2011-08-26     3.000
#4  6    233989 2011-05-14     0.003
#5  8    909978 2004-09-12     0.200
#6  9    909978 2005-05-10     0.230
#7 12   3987387 2014-08-26     1.170

Or we can use data.table. Here we convert the 'df2' from 'data.frame' to 'data.table' (setDT), set the key as 'StudentId' (setkey(..., StudentId)), join with a subset of 'df1' ('StudentId', 'Date1'), filter the output dataset based on the condition (.SD[Date2 <= Date1]) grouped by the 'key' variable. More info about .EACHI is here

library(data.table)
setkey(setDT(df2),StudentId)[df1[1:2], .SD[Date2<=Date1],by=.EACHI][order(Id)]
#   StudentId Id      Date2 Result_Nm
#1:    777492  1 2012-06-10     0.100
#2:    777492  2 2013-12-06     2.000
#3:     23433  4 2011-08-26     3.000
#4:    233989  6 2011-05-14     0.003
#5:    909978  8 2004-09-12     0.200
#6:    909978  9 2005-05-10     0.230
#7:   3987387 12 2014-08-26     1.170

NOTE: The 'Dates' were already changed to 'Date' class before the join.

data

df1 <-  structure(list(StudentId = c(23433L, 233989L, 909978L,
777492L, 
3987387L), Date1 = c("2014-08-26", "2014-08-18", "2014-08-06", 
"2014-08-11", "2014-08-26"), Lunch = c("Yes", "No", "No", "Yes", 
"No")), .Names = c("StudentId", "Date1", "Lunch"), 
class = "data.frame", row.names = c(NA, -5L))

df2 <-  structure(list(Id = 1:12, StudentId = c(777492L, 777492L, 
777492L, 
23433L, 23433L, 233989L, 233989L, 909978L, 909978L, 909978L, 
3987387L, 3987387L), Date2 = c("2012.06.10", "2013.12.06", 
"2014.08.30", 
"2011.08.26", "2015.04.06", "2011.05.14", "2014.09.14", "2004-09-12", 
"2005-05-10", "2015-01-02", "2014-10-06", "2014-08-26"), 
Result_Nm = c(0.1, 
2, 0.6, 3, 3, 0.003, 0.05, 0.2, 0.23, 2.4, 3.5, 1.17)),
.Names = c("Id", 
"StudentId", "Date2", "Result_Nm"), class = "data.frame", 
row.names = c(NA, -12L))

Upvotes: 2

Related Questions