Reputation: 1017
I have a datasetup function which currently has 2 arguments: testData and ID1. I want to include outcome variable as an argument.
Suppose outcomevar=c(y1,y2,y3) then the function should create the lagged and differenced variable of my outcome variable.
preparedata<-function(testData,ID1,outcomevar){
#Order temp data by firm and date
testData <- testData[order(testData$firm,testData$date),]
#Create lagged outcomevar for each firm
testData <- ddply(testData, .(firm), transform,
ly1 = c( NA, y1[-length(y1)] ) )
#Create differenced variable
testData$dy1<-(testData$y1-testData$ly1)
}
where the "l" and "d" in front of y1 stand for lagged and differenced. Depending How can I include the outcome variable? Thanks T
Upvotes: 0
Views: 158
Reputation: 103898
You could process all outcome variables simultaneously by first gathering them into a key-value column pair:
set.seed(1)
df <- data.frame(
firm = rep(LETTERS[1:5], each = 10),
date = as.Date("2014-01-01") + 1:10,
y1 = sample(100, 50),
y2 = sample(100, 50),
y3 = sample(100, 50)
)
library(dplyr)
library(tidyr)
df %>%
gather(key, value, y1:y3) %>%
group_by(firm, key) %>%
mutate(lag = lag(value), diff = lag - value)
#> Source: local data frame [150 x 6]
#> Groups: firm, key
#>
#> firm date key value lag diff
#> 1 A 2014-01-02 y1 27 NA NA
#> 2 A 2014-01-03 y1 37 27 -10
#> 3 A 2014-01-04 y1 57 37 -20
#> 4 A 2014-01-05 y1 89 57 -32
#> 5 A 2014-01-06 y1 20 89 69
#> 6 A 2014-01-07 y1 86 20 -66
#> 7 A 2014-01-08 y1 97 86 -11
#> 8 A 2014-01-09 y1 62 97 35
#> 9 A 2014-01-10 y1 58 62 4
#> 10 A 2014-01-11 y1 6 58 52
#> .. ... ... ... ... ... ...
Upvotes: 0
Reputation: 11514
Here is an outline of a function that relies more heavily on your example:
preparedata<-function(testData,outcomevar){
require(plyr)
testData <- testData[order(testData$firm,testData$date),]
testData$tmp.var <- with(testData, eval(parse(text=outcomevar)))
testData <- ddply(testData, .(firm), transform,
lvar = c( NA, tmp.var[-length(tmp.var)]))
testData$tmp.var <- NULL
testData <- within(testData, assign(paste("d", outcomevar, sep=""),
testData[,outcomevar]-testData$lvar))
colnames(testData)[grep("lvar", colnames(testData))] <- paste("l", outcomevar, sep="")
return(testData)
}
Using the df
defined in jihoward's answer, we get
> head(preparedata(df,"y1"))
firm date y1 y2 y3 lvar dy1
1 A 2014-01-02 27 48 66 NA NA
2 A 2014-01-03 37 86 35 27 10
3 A 2014-01-04 57 43 27 37 20
4 A 2014-01-05 89 24 97 57 32
5 A 2014-01-06 20 7 61 89 -69
6 A 2014-01-07 86 10 21 20 66
This function returns a dataframe where ly1
is the lagged variable, and dy1
is the differenced variable that was specified with the second argument outcomevar
. Note that in this function, you pass the name (i.e. a character) to the function. That is, do not write y1
, but "y1"
when you call the function.
Upvotes: 0
Reputation: 59355
Here's a solution using data tables:
# create sample dataset
set.seed(1)
df <- data.frame(firm=rep(LETTERS[1:5],each=10),
date=as.Date("2014-01-01")+1:10,
y1=sample(1:100,50),y2=sample(1:100,50),y3=sample(1:100,50))
preparedata<-function(testData,ID1,outcomevar){
require(data.table)
DT <- as.data.table(testData)
setkey(DT,firm,date)
DT[,lag := c(NA,unlist(.SD)[-.N]), by=firm, .SDcols=outcomevar]
DT[,diff := c(NA,diff(unlist(.SD))), by=firm, .SDcols=outcomevar]
setnames(DT,c("lag","diff"),paste0(c("l","d"),outcomevar))
return(DT)
}
result <- preparedata(df,1,outcomevar="y1")
head(result)
# firm date y1 y2 y3 ly1 dy1
# 1: A 2014-01-02 27 48 66 NA NA
# 2: A 2014-01-03 37 86 35 27 10
# 3: A 2014-01-04 57 43 27 37 20
# 4: A 2014-01-05 89 24 97 57 32
# 5: A 2014-01-06 20 7 61 89 -69
# 6: A 2014-01-07 86 10 21 20 66
This assumes you pass the name of the column containing the "outcomevar", not the column itself.
You should read the documentation on data tables (?data.table
), but in brief this code converts the input data frame to a data table, orders the data table (using setkey(...)
), and adds two new columns by reference: lag
and diff
. .SD
is a special variable in the data table framework which is an alias for "the subset of the original DT containing the rows specified in by=...
". You can specify which columns to include using .SDcols=...
. The diff(...)
function calculates lagged differences, which is the same thing you were doing. Finally, we rename the columns lag
and diff
to, e.g. ly1
and dy1
.
Upvotes: 1