stuwest
stuwest

Reputation: 928

Distinct rows based on date range

I have a large-ish dataset with a structure something like the following:

structure(list(pathogen = c("MRSA", "L. pneumophila", "MRSA", 
"L. pneumophila", "MRSA", "MRSA", "MRSA", "L. pneumophila", "L. pneumophila", 
"MRSA"), variant = c("mecA", "sg1", "mecA", "sg1", "mecA", "mecC", 
"mecA", "sg1", "sg6", "mecA"), n = c(25L, 14L, 235L, 2L, 64L, 
15L, 13L, 6L, 11L, 8L), date = structure(c(15156, 15248, 15279, 
15279, 15309, 15340, 15340, 15400, 15431, 15461), class = "Date")), .Names = c("pathogen", 
"variant", "n", "date"), row.names = c(NA, -10L), class = "data.frame")

I want to find every row with a combination of variables that has not been recorded in the previous x-month period. So when I look for a combination of pathogen and variant that has not been recorded in the previous 3 months I go from:

         pathogen variant   n       date
1            MRSA    mecA  25 2011-07-01
2  L. pneumophila     sg1  14 2011-10-01
3            MRSA    mecA 235 2011-11-01
4  L. pneumophila     sg1   2 2011-11-01
5            MRSA    mecA  64 2011-12-01
6            MRSA    mecC  15 2012-01-01
7            MRSA    mecA  13 2012-01-01
8  L. pneumophila     sg1   6 2012-03-01
9  L. pneumophila     sg6  11 2012-04-01
10           MRSA    mecA   8 2012-05-01

to:

         pathogen variant   n       date
1            MRSA    mecA  25 2011-07-01
2  L. pneumophila     sg1  14 2011-10-01
3            MRSA    mecA 235 2011-11-01
6            MRSA    mecC  15 2012-01-01
8  L. pneumophila     sg1   6 2012-03-01
9  L. pneumophila     sg6  11 2012-04-01
10           MRSA    mecA   8 2012-05-01

All the solutions I've thought of so far involve writing loops. I'm also trying to use dplyr for analysis as much as possible, so my question is: is this possible in dplyr? And if not, what would an R-ish approach look like?

Upvotes: 3

Views: 215

Answers (1)

talat
talat

Reputation: 70256

I'm not sure about the best way to handle exact month intervalls, but to get you started, you could compute the difference in number of days as follows (resulting in the same output as shown in the question):

df %>%
  group_by(pathogen, variant) %>%
  filter(c(TRUE, diff(date) > 90))   # check for difference of 90 days
#Source: local data frame [7 x 4]
#Groups: pathogen, variant
#
#        pathogen variant   n       date
#1           MRSA    mecA  25 2011-07-01
#2 L. pneumophila     sg1  14 2011-10-01
#3           MRSA    mecA 235 2011-11-01
#4           MRSA    mecC  15 2012-01-01
#5 L. pneumophila     sg1   6 2012-03-01
#6 L. pneumophila     sg6  11 2012-04-01
#7           MRSA    mecA   8 2012-05-01

Upvotes: 3

Related Questions