Rebecca
Rebecca

Reputation: 79

Loop to calculate linear models and write results to matrix

I have a dataset that is organized as below, with data for 23 dates (shown below are data for 1 date and part of a second date - note: the headings are offset). I would like to run linear models of y~x (lm(y~x)) by FrontBack and date (i.e., 2 lm's for each date, 1 for Front and 1 for Back). I would then like to summarize the output in a matrix where the slope, intercept, and errors for each of these are listed in separate columns. This should give me a matrix with 46 rows (23 dates x 2 levels of front/back).

Date FrontBack     y            x    
20140916    Back    2234.580    2.253175    
20140916    Back    2267.631    7.725422    
20140916    Back    2246.668    14.414951    
20140916    Back    2216.307    17.837861   
20140916    Back    2214.225    15.484364   
20140916    Front   2245.522    90.062102   
20140916    Back    2219.565    12.326474   
20140916    Front   2267.427    63.396137   
20140916    Back    2213.286    7.861758   
20140916    Front   2264.902    61.661650   
20140916    Front   2256.183    70.124702   
20140916    Back    2202.254    7.400539   
20140916    Front   2241.997    44.826769   
20140916    Back    2204.868    5.739663   
20140916    Back    2209.424    2.165606   
20140916    Front   2266.947    1.068334   
20140917    Back    2237.199    2.190785   
20140917    Back    2248.541    4.415886   
20140917    Back    2260.041    8.724817   
20140917    Back    2277.407    13.420694   
20140917    Back    2278.414    14.789667   
20140917    Front   2346.622    29.878672   
20140917    Back    2268.111    15.120095   
20140917    Front   2496.946    60.30390

Upvotes: 1

Views: 833

Answers (2)

bergant
bergant

Reputation: 7232

For example:

# break data frame to list of data frames
df_list <- split(df1, f = paste(df1$Date, df1$FrontBack) )

# run lm models on each data frame
models <- lapply(df_list, function(dat) {
  lm(y ~ x, data = dat)
})

# format result
res <- 
  do.call(
    rbind,
    lapply(models, function(x) {
      coefs <- coef(summary(x))
      c( intercept = coefs["(Intercept)", "Estimate"],
         slope = coefs["x", "Estimate"],
         p_value = coefs["x","Pr(>|t|)"], 
         InterceptP = coefs["(Intercept)","Pr(>|t|)"], 
         StdErrorX = coefs["x","Std. Error"],
         StdErrorI = coefs["(Intercept)","Std. Error"],
         r2 = x$r.squared,
         AIC = AIC(x)
        )
    })
  )

Upvotes: 1

chepyle
chepyle

Reputation: 996

Because you only have two categories, would subsetting the front and back portions be ok? If so, the following code should give you what you were looking for:

fake.y=runif(15*5,min=2200,max=2500) #make some fake data for testing
fake.FrontBack<-sample(c("Front","Back"),15*5,replace=T)
fake.Date<-sort(sample(seq(20140916,20140920),15*5,replace=T))
fake.data<-data.frame(Date=fake.Date,FrontBack=fake.FrontBack,y=fake.y,x=fake.y/120+runif(15*5)+1*(fake.FrontBack=="Back"))

# starting here use your data instead of fake.data
f<-subset(fake.data,FrontBack=="Front") # use subset to select front/back
b<-subset(fake.data,FrontBack=="Back")
f.fit<-lm(y~x,data=f)# regress front values
f.resid<-data.frame(Date=f$Date,resid=f.fit$residuals,int=f.fit$coefficients[1],slope=f.fit$coefficients[2],FrontBack="Front") # make dataframe of residuals
b.fit<-lm(y~x,data=b)# regress back values
b.resid<-data.frame(Date=b$Date,resid=b.fit$residuals,int=b.fit$coefficients[1],slope=b.fit$coefficients[2],FrontBack="Back") # make dataframe of residuals
all.resid<-rbind(f.resid,b.resid) # stick 'em together
head(all.resid) # should be what you wanted, residual error for all entires

# could be what you wanted, aggregate mean error for each date, front and back
ag<-aggregate(resid~.,data=all.resid,mean)
print(ag)

the code will spit out something like this, to show the model error residual for each row in your original data:

      Date     resid      int    slope FrontBack
1 20140916 -47.91676 393.5386 97.52173     Front
3 20140916  52.01027 393.5386 97.52173     Front
4 20140916  52.58631 393.5386 97.52173     Front
5 20140916 -17.56038 393.5386 97.52173     Front
6 20140916 -21.85633 393.5386 97.52173     Front
9 20140916  44.97382 393.5386 97.52173     Front

If you wanted the average error for each day, the aggregated data will be:

       Date      int     slope FrontBack      resid
1  20140916 393.5386  97.52173     Front  10.372821
2  20140917 393.5386  97.52173     Front  -3.840699
3  20140918 393.5386  97.52173     Front -10.876092
4  20140919 393.5386  97.52173     Front  -2.159973
5  20140920 393.5386  97.52173     Front   8.878526
6  20140916 212.6598 101.60367      Back   2.862525
7  20140917 212.6598 101.60367      Back -14.476662
8  20140918 212.6598 101.60367      Back  10.822712
9  20140919 212.6598 101.60367      Back -10.072473
10 20140920 212.6598 101.60367      Back  10.472516

With a separate front and back entries for each date, the subsetting and regression should work in the same for your data as in this example

Upvotes: 1

Related Questions