TinaW
TinaW

Reputation: 1017

outcome variable as argument in regression function

I have a datasetup function which currently has 2 arguments: testData and ID1. I want to include outcome variable as an argument.

Suppose outcomevar=c(y1,y2,y3) then the function should create the lagged and differenced variable of my outcome variable.

preparedata<-function(testData,ID1,outcomevar){
#Order temp data by firm and date
            testData <- testData[order(testData$firm,testData$date),]
#Create lagged outcomevar for each firm
            testData <- ddply(testData, .(firm), transform,
            ly1 = c( NA, y1[-length(y1)] ) )
#Create differenced variable
            testData$dy1<-(testData$y1-testData$ly1)
}

where the "l" and "d" in front of y1 stand for lagged and differenced. Depending How can I include the outcome variable? Thanks T

Upvotes: 0

Views: 158

Answers (3)

hadley
hadley

Reputation: 103898

You could process all outcome variables simultaneously by first gathering them into a key-value column pair:

set.seed(1)
df <- data.frame(
  firm = rep(LETTERS[1:5], each = 10),
  date = as.Date("2014-01-01") + 1:10,
  y1 = sample(100, 50),
  y2 = sample(100, 50),
  y3 = sample(100, 50)
)

library(dplyr)
library(tidyr)
df %>%
  gather(key, value, y1:y3) %>%
  group_by(firm, key) %>%
  mutate(lag = lag(value), diff = lag - value)
#> Source: local data frame [150 x 6]
#> Groups: firm, key
#> 
#>    firm       date key value lag diff
#> 1     A 2014-01-02  y1    27  NA   NA
#> 2     A 2014-01-03  y1    37  27  -10
#> 3     A 2014-01-04  y1    57  37  -20
#> 4     A 2014-01-05  y1    89  57  -32
#> 5     A 2014-01-06  y1    20  89   69
#> 6     A 2014-01-07  y1    86  20  -66
#> 7     A 2014-01-08  y1    97  86  -11
#> 8     A 2014-01-09  y1    62  97   35
#> 9     A 2014-01-10  y1    58  62    4
#> 10    A 2014-01-11  y1     6  58   52
#> ..  ...        ... ...   ... ...  ...

Upvotes: 0

coffeinjunky
coffeinjunky

Reputation: 11514

Here is an outline of a function that relies more heavily on your example:

 preparedata<-function(testData,outcomevar){
   require(plyr)
   testData <- testData[order(testData$firm,testData$date),]
   testData$tmp.var <- with(testData, eval(parse(text=outcomevar)))
   testData <- ddply(testData, .(firm), transform, 
                     lvar = c( NA, tmp.var[-length(tmp.var)]))
   testData$tmp.var <- NULL
   testData <- within(testData, assign(paste("d", outcomevar, sep=""),
                                       testData[,outcomevar]-testData$lvar))
   colnames(testData)[grep("lvar", colnames(testData))] <- paste("l", outcomevar, sep="")
   return(testData)
 }

Using the df defined in jihoward's answer, we get

 > head(preparedata(df,"y1"))

   firm       date y1 y2 y3 lvar dy1
 1    A 2014-01-02 27 48 66   NA  NA
 2    A 2014-01-03 37 86 35   27  10
 3    A 2014-01-04 57 43 27   37  20
 4    A 2014-01-05 89 24 97   57  32
 5    A 2014-01-06 20  7 61   89 -69
 6    A 2014-01-07 86 10 21   20  66

This function returns a dataframe where ly1 is the lagged variable, and dy1 is the differenced variable that was specified with the second argument outcomevar. Note that in this function, you pass the name (i.e. a character) to the function. That is, do not write y1, but "y1" when you call the function.

Upvotes: 0

jlhoward
jlhoward

Reputation: 59355

Here's a solution using data tables:

# create sample dataset
set.seed(1)
df <- data.frame(firm=rep(LETTERS[1:5],each=10),
                 date=as.Date("2014-01-01")+1:10,
                 y1=sample(1:100,50),y2=sample(1:100,50),y3=sample(1:100,50))


preparedata<-function(testData,ID1,outcomevar){
  require(data.table)
  DT <- as.data.table(testData)
  setkey(DT,firm,date)
  DT[,lag  := c(NA,unlist(.SD)[-.N]),  by=firm, .SDcols=outcomevar]
  DT[,diff := c(NA,diff(unlist(.SD))), by=firm, .SDcols=outcomevar]
  setnames(DT,c("lag","diff"),paste0(c("l","d"),outcomevar))
  return(DT)
}

result <- preparedata(df,1,outcomevar="y1")
head(result)
#    firm       date y1 y2 y3 ly1 dy1
# 1:    A 2014-01-02 27 48 66  NA  NA
# 2:    A 2014-01-03 37 86 35  27  10
# 3:    A 2014-01-04 57 43 27  37  20
# 4:    A 2014-01-05 89 24 97  57  32
# 5:    A 2014-01-06 20  7 61  89 -69
# 6:    A 2014-01-07 86 10 21  20  66

This assumes you pass the name of the column containing the "outcomevar", not the column itself.

You should read the documentation on data tables (?data.table), but in brief this code converts the input data frame to a data table, orders the data table (using setkey(...)), and adds two new columns by reference: lag and diff. .SD is a special variable in the data table framework which is an alias for "the subset of the original DT containing the rows specified in by=...". You can specify which columns to include using .SDcols=.... The diff(...) function calculates lagged differences, which is the same thing you were doing. Finally, we rename the columns lag and diff to, e.g. ly1 and dy1.

Upvotes: 1

Related Questions