Ore M
Ore M

Reputation: 257

ddply for regression in R

I'm I have a data frame that contains an 'output', average temperature, humidity, and time (given as 24 factors, not continuous) data for 100 cities (given by codes). I want to apply a regression formula to predict the output for each city based on the temperature, humidity, and time data. I hope to get 100 different regression models. I used the ddply function and came up with the following line of code with help from this thread.

df = ddply(data, "city", function(x) coefficients(lm(output~temperature+humidity, data=x)))

This code works for the numeric data, temperature and humidity. But when I add in the time zone factor data (which is 23 factor variables) I get an error:

df = ddply(data, "city", function(x) coefficients(lm(output~temperature+humidity+time, data=x)))

"Error: contrasts can be applied only to factors with 2 or more levels"

Does anyone know why this is? Here is an example chunk of my data frame:

city    temperature   humidity   time   output
 11        51            34        01     201
 11        43            30        02     232
 11        55            50        03     253  
 11        64            54        10     280  
 22        21            52        11     321  
 22        43            65        04     201  
 22        51            66        09     211  
 22        51            78        16     199  
 05        45            70        01     202  
 05        51            54        10     213 

So I would want three models for the three cities here, based on temperature, humidity, and the time factor.

Upvotes: 4

Views: 850

Answers (1)

Marat Talipov
Marat Talipov

Reputation: 13304

By using ddply, you apply lm to the subsets of your data frame, where each subset corresponds to a certain city. It seems to be the case that some cities in the full data set have only one record. For such cases, the statistical analysis is obviously meaningless, however lm will return you some answer, but if you have a factor variable in the model, it'll throw an error.

As a workaround, you could check the number of rows inside your anonymous function:

ddply(d,'city',function(x) if (nrow(x)==1) return() else coefficients(lm(output~temperature+humidity+time, data=x)))

where d is slightly modified version of your sample set, in which I changed the id of the city in the last row to make sure that some cities have only one record:

d <- structure(list(city = c(11, 11, 11, 11, 22, 22, 22, 22, 5, 7), temperature = c(51L, 43L, 55L, 64L, 21L, 43L, 51L, 51L, 45L,     51L), humidity = c(34L, 30L, 50L, 54L, 52L, 65L, 66L, 78L,     70L, 54L), time = structure(c(1L, 2L, 3L, 6L, 7L, 4L, 5L,     8L, 1L, 6L), .Label = c("1", "2", "3", "4", "9", "10", "11",     "16"), class = "factor"), output = c(201L, 232L, 253L, 280L,     321L, 201L, 211L, 199L, 202L, 213L)), .Names = c("city", "temperature", "humidity", "time", "output"), row.names = c(NA, -10L), class = "data.frame")

You could also use this base R code instead of ddply:

L <- split(d,d$city)

L2 <- lapply(L,function(x) {
    if (nrow(x)==1) 
        return() 
    else 
        coefficients(lm(output~temperature+humidity+time, data=x))
})

M <- do.call(rbind,L2)
df <- as.data.frame(M)

This code is more wordy but it is much easier to inspect and analyze it in case of problematic behavior.

Upvotes: 5

Related Questions