Reputation: 257
I'm I have a data frame that contains an 'output', average temperature, humidity, and time (given as 24 factors, not continuous) data for 100 cities (given by codes). I want to apply a regression formula to predict the output for each city based on the temperature, humidity, and time data. I hope to get 100 different regression models. I used the ddply function and came up with the following line of code with help from this thread.
df = ddply(data, "city", function(x) coefficients(lm(output~temperature+humidity, data=x)))
This code works for the numeric data, temperature and humidity. But when I add in the time zone factor data (which is 23 factor variables) I get an error:
df = ddply(data, "city", function(x) coefficients(lm(output~temperature+humidity+time, data=x)))
"Error: contrasts can be applied only to factors with 2 or more levels"
Does anyone know why this is? Here is an example chunk of my data frame:
city temperature humidity time output
11 51 34 01 201
11 43 30 02 232
11 55 50 03 253
11 64 54 10 280
22 21 52 11 321
22 43 65 04 201
22 51 66 09 211
22 51 78 16 199
05 45 70 01 202
05 51 54 10 213
So I would want three models for the three cities here, based on temperature, humidity, and the time factor.
Upvotes: 4
Views: 850
Reputation: 13304
By using ddply
, you apply lm
to the subsets of your data frame, where each subset corresponds to a certain city. It seems to be the case that some cities in the full data set have only one record. For such cases, the statistical analysis is obviously meaningless, however lm
will return you some answer, but if you have a factor variable in the model, it'll throw an error.
As a workaround, you could check the number of rows inside your anonymous function:
ddply(d,'city',function(x) if (nrow(x)==1) return() else coefficients(lm(output~temperature+humidity+time, data=x)))
where d
is slightly modified version of your sample set, in which I changed the id of the city in the last row to make sure that some cities have only one record:
d <- structure(list(city = c(11, 11, 11, 11, 22, 22, 22, 22, 5, 7), temperature = c(51L, 43L, 55L, 64L, 21L, 43L, 51L, 51L, 45L, 51L), humidity = c(34L, 30L, 50L, 54L, 52L, 65L, 66L, 78L, 70L, 54L), time = structure(c(1L, 2L, 3L, 6L, 7L, 4L, 5L, 8L, 1L, 6L), .Label = c("1", "2", "3", "4", "9", "10", "11", "16"), class = "factor"), output = c(201L, 232L, 253L, 280L, 321L, 201L, 211L, 199L, 202L, 213L)), .Names = c("city", "temperature", "humidity", "time", "output"), row.names = c(NA, -10L), class = "data.frame")
You could also use this base R code instead of ddply
:
L <- split(d,d$city)
L2 <- lapply(L,function(x) {
if (nrow(x)==1)
return()
else
coefficients(lm(output~temperature+humidity+time, data=x))
})
M <- do.call(rbind,L2)
df <- as.data.frame(M)
This code is more wordy but it is much easier to inspect and analyze it in case of problematic behavior.
Upvotes: 5