count
count

Reputation: 1338

Extract rows with most current date out of a data frame

I would like to extract certain rows from a data frame containing one colum as Date (column C). Here is a small example:

The output should look like this:

Before <- data.frame(A=c("0010","0011","0012","0015","0024","0032","0032","0033","0039","0039","0039","0041","0054"),
                     B=c(11,12,11,11,12,12,12,11,"NA","NA",11,11,11),
                     C=c("2014-01-07","2013-06-03","2013-07-29","2014-07-14","2012-12-17","2013-08-21","2013-08-21","2014-07-11","2012-10-06","2012-10-06","2013-10-22","2014-05-28","2014-03-26"))

After <- data.frame(A=c("0010","0011","0012","0015","0024","0032","0033","0039","0041","0054"),
                    B=c(11,12,11,11,12,12,11,11,11,11),
                    C=c("2014-01-07","2013-06-03","2013-07-29","2014-07-14","2012-12-17","2013-08-21","2014-07-11","2013-10-22","2014-05-28","2014-03-26"))

So would I'm aiming for is:

  1. Only give out entries with the latest date (out of row 9,10,11 (BEFORE)) --> give out only row 8 (AFTER)
  2. Give out identical entries only once (row 6 and 7 (BEFORE)) --> give out only row 6 (AFTER)

I wasn't able to find a solution using subset, unique etc. Any help appreciated!

Upvotes: 0

Views: 198

Answers (4)

Arun
Arun

Reputation: 118779

Here are two data.table variations depending on the assumptions on data:

  • Assuming that your data already has the latest date for each group of A as the last element:

    require(data.table)
    setDT(Before)[, .SD[.N], by=A]
    

.SD holds the Subset of Data for each group in A and .N holds the number of observations in that group. So, .SD[.N] gives us the last observation, for each group.

  • Without any assumptions:

    require(data.table)
    setDT(Before)[, C := as.Date(C)][, .SD[which.max(C)], by=A]
    

Here, first we replace C with as.Date(C) using data.table's := operator which modifies by reference (without making any copy, hence fast+memory efficient). Then, for each A data subset, we subset the row correspondng to the maximum value of C.

HTH

Upvotes: 1

jfreels
jfreels

Reputation: 123

require(dplyr)
Before %>%
  mutate(C=as.Date(C)) %>%
  group_by(A) %>%
  arrange(A,desc(C)) %>%
  filter(row_number()==1)

#Source: local data frame [10 x 3]
#Groups: A

#      A  B          C
#1  0010 11 2014-01-07
#2  0011 12 2013-06-03
#3  0012 11 2013-07-29
#4  0015 11 2014-07-14
#5  0024 12 2012-12-17
#6  0032 12 2013-08-21
#7  0033 11 2014-07-11
#8  0039 11 2013-10-22
#9  0041 11 2014-05-28
#10 0054 11 2014-03-26

Upvotes: 1

Anders Ellern Bilgrau
Anders Ellern Bilgrau

Reputation: 10223

By using the fact that dates act like numerics something like the following might do the trick:

Before$C <- as.Date(Before$C)  # Convert to dates
ans <- aggregate(C ~ A + B, max, data = Before)  # Aggregate date, choose the last date
ans <- ans[ans$B != "NA", ]  # Remove NA in col B
print(ans)
#      A  B          C
#1  0010 11 2014-01-07
#2  0012 11 2013-07-29
#3  0015 11 2014-07-14
#4  0033 11 2014-07-11
#5  0039 11 2013-10-22
#6  0041 11 2014-05-28
#7  0054 11 2014-03-26
#8  0011 12 2013-06-03
#9  0024 12 2012-12-17
#10 0032 12 2013-08-21

The max of type Date will return the most recent one.

Upvotes: 0

Roland
Roland

Reputation: 132626

split-apply-combine:

Before$C <- as.Date(Before$C)
library(plyr)
ddply(Before, .(A), function(df) {
  df <- df[df$C==max(df$C),]
  df[!duplicated(df),]
  })

#      A  B          C
#1  0010 11 2014-01-07
#2  0011 12 2013-06-03
#3  0012 11 2013-07-29
#4  0015 11 2014-07-14
#5  0024 12 2012-12-17
#6  0032 12 2013-08-21
#7  0033 11 2014-07-11
#8  0039 11 2013-10-22
#9  0041 11 2014-05-28
#10 0054 11 2014-03-26

Upvotes: 0

Related Questions