tjnel
tjnel

Reputation: 683

Apply Function for Rows of a Dataframe where the Function uses Row Elements

I'm trying to apply a function to each element of a data frame. A simple example of such a data frame would be:

> accts
ACCOUNT       DATE
   1     2008-03-01
   2     2009-06-17
   3     2008-07-02
   4     2009-03-15

What I need to do is look at each row of this data frame and go find that account in a larger data frame, like the one shown below:

> trans
ACCOUNT_NUM  TRAN_DATE
        1    2008-02-02
        2    2008-04-02
        3    2008-03-16
        3    2009-08-22
        3    2008-05-05
        6    2010-11-03
        7    2008-09-18
        4    2009-10-14
        4    2009-01-15
       10    2011-07-06

For each row in the 'accts' data frame I need to get the record in the 'trans' data frame corresponding to that account which also has the 'TRAN_DATE' that occurred nearest the 'DATE' but prior to it. I tried to use the apply function:

tranDateVector <- apply(accts, 2, getTranDate)

getTranDate <- function(x)
{
  tranDate <- subset(trans$TRAN_DATE, with(trans, ACCOUNT_NUM == x[1] & TRAN_DATE < x[2]))
  dataDiff <- x[2] - tranDate
  tranDate <- unique(date[which(dateDiff == min(dateDiff))])
  return(tranDate)
}

accts <- cbind(accts, tranDateVector)

When I run my mini-example I get the following error:

Error in charToDate(x) : 
  character string is not in a standard unambiguous format

However, when I run my full-blown version I get a different error, which I have realized is coming from this line:

subset(trans$TRAN_DATE, with(trans, ACCOUNT_NUM == x[1] & TRAN_DATE < x[2]))

If I set x to be the third row of my 'accts' data frame, so:

     x
     ACCOUNT       DATE
3       3      2008-07-02

and run the 'subset' line of code I get the following error, which corresponds to the error I get on my regular code:

> subset(trans$TRAN_DATE, with(trans, ACCOUNT_NUM == x[1] & TRAN_DATE < x[2]))
Error in eval(expr, envir, enclos) : 
  dims [product 1] do not match the length of object [10]
In addition: Warning message:
In eval(expr, envir, enclos) :
  Incompatible methods ("Ops.Date", "Ops.data.frame") for "<"

Thanks for your help.


(Information below was added after the answer to the above was provided b/c I realized there was a complication)

There are additional constraints on the function that I just realized need to be considered, and these cause the problem to become a bit more complicated. In the 'accts' data frame there are two different statuses:

> accts <- data.frame(
+     ACCOUNT = 1:4,
+     DATE = as.Date(c("2008-03-01", "2009-06-17",
+                      "2008-07-02", "2009-03-15")),
+     STATUS = c("new", "old", "new", "old"))

In the 'accts' frame a record can be classified as either old or new. If the account is 'new' than it needs to meet the conditions specified earlier, but it also must only be matched with records in 'trans' flagged as 'revised'. Likewise for 'old' accounts, they can only be compared to the 'orig' records of trans:

> trans <- data.frame(
+     ACCOUNT_NUM = c(1,2,3,3,3,6,7,4,4,10),
+     TRAN_DATE = as.Date(c("2008-02-02", "2008-04-02",
+                           "2008-03-16", "2009-08-22",
+                           "2008-05-05", "2010-11-03",
+                           "2008-09-18", "2009-10-14",
+                           "2009-01-15", "2011-07-06")),
+     BALANCE = c("orig", "orig", "orig", "orig", "revised", "orig", "revised", "revised", "revised", "orig"))

I tried to implement your code to fit this situation as follows:

library(plyr)
adply(accts, 1, transform,
            TRAN_DATE = { 
                 if(STATUS == "old")
                 {
                    data <- subset(trans, ACCOUNT_NUM == ACCOUNT &
                                             TRAN_DATE < DATE & BALANCE == "orig")
                 }else{
                    data <- subset(trans, ACCOUNT_NUM == ACCOUNT &
                                             TRAN_DATE < DATE & BALANCE == "revised")
                 }
                 tail(data$TRAN_DATE, 1) })

I get the following error from this code:

Error in data.frame(list(ACCOUNT = 1L, DATE = 13939, STATUS = 1L), BALANCE = list( : 
  arguments imply differing number of rows: 1, 0

My apologies for not specifying this requirement in my initial post, I didn't realize it would cause a problem.

Upvotes: 1

Views: 3743

Answers (1)

flodel
flodel

Reputation: 89057

Because you data mixes types (numbers, dates), I'd stay away from using apply as it will coerce your data into a single type. Instead I'd recommend using plyr's adply function which does preserve all types as each row is processed as a data.frame. It also has the advantage that fields can still be accessed using the column names and that usually leads to much more readable code as I will let you judge.

Your data:

accts <- data.frame(
  ACCOUNT = 1:4,
  DATE = as.Date(c("2008-03-01", "2009-06-17",
                   "2008-07-02", "2009-03-15")))

trans <- data.frame(
  ACCOUNT_NUM = c(1,2,3,3,3,6,7,4,4,10),
  TRAN_DATE = as.Date(c("2008-02-02", "2008-04-02",
                        "2008-03-16", "2009-08-22",
                        "2008-05-05", "2010-11-03",
                        "2008-09-18", "2009-10-14",
                        "2009-01-15", "2011-07-06")))

A solution using adply:

library(plyr)
adply(accts, 1, transform,
      TRAN_DATE = { data <- subset(trans, ACCOUNT_NUM == ACCOUNT &
                                          TRAN_DATE < DATE)
                    tail(data$TRAN_DATE, 1) })
#   ACCOUNT       DATE  TRAN_DATE
# 1       1 2008-03-01 2008-02-02
# 2       2 2009-06-17 2008-04-02
# 3       3 2008-07-02 2008-05-05
# 4       4 2009-03-15 2009-01-15

Upvotes: 4

Related Questions