kl-higgins
kl-higgins

Reputation: 313

How to fix "invalid type (NULL) for variable" error in ctree function using the party package in R?

I am using the ctree() in the party package from R. I want to be able to columns from more than one dataframe, for which I would call use column separately (using $) as I have in the past with this function but this time it is not working.

For the purposes of illustrating the error, I've put together a sample data set as a single dataframe. When I run:

>ctree(data$adult_age~data$child_age+data$freq)

I get the following error:

>Error in model.frame.default(formula = ~data$adult_age, data = list(),  : 
  invalid type (NULL) for variable 'data$adult_age'

If I run it like this, it works:

>ctree(adult_age~child_age+freq, data)

Usually those two ways of writing it out are interchangeable (e.g. with lm() I get the same results with both), but with ctree() I am running into an error. Why? How can I fix this so that I can pull from different dataframes at once without combining them?

My data structure looks like this:

> dput(data)

>structure(list(adult_age = c(38, 38, 38, 38, 38, 55.5, 55.5, 38, 38, 38), child_age = c(8, 8, 13, 3.5, 3.5, 13, 8, 8, 8, 13), freq = c(0.1, 12, 0.1, 0.1, 0.1, 0.1, 1, 2, 0.1, 0.1)), .Names = c("adult_age", "child_age", "freq"), class = "data.frame", row.names = c(12L, 13L, 14L, 15L, 18L, 20L, 22L, 23L, 24L, 25L))

If you want to run the sample data:

>adult_age = c(38, 38, 38, 38, 38, 55.5, 55.5, 38, 38, 38)

>child_age = c(8, 8, 13, 3.5, 3.5, 13, 8, 8, 8, 13)

>freq = c(0.1, 12, 0.1, 0.1, 0.1, 0.1, 1, 2, 0.1, 0.1)

>data=as.data.frame(cbind(adult_age, child_age, freq))

Upvotes: 2

Views: 6880

Answers (1)

Achim Zeileis
Achim Zeileis

Reputation: 17168

Why this approach should not be applied

Never use data$ inside model formulas (as already pointed out by @Roland). Apart from the fact that you unnecessarily repeat the data name and have to type more, it is a source of confusion and errors. If you haven't encountered this problem, yet, with lm() then you haven't used predict(). Consider a simple linear regression for your data:

m1 <- lm(adult_age ~ child_age, data = data)
m2 <- lm(data$adult_age ~ data$child_age)
coef(m1) - coef(m2)
## (Intercept)   child_age 
##           0           0 

Thus, both approaches lead to the same coefficient estimates etc. But in all situations where you want to use the same formula with a different/updated/subsetted data, you run into trouble. Prominently, in predict(), e.g., when making a prediction at child_age = 0. The intended usage with formula and data separated correctly recovers the intercept:

predict(m1, newdata = data.frame(child_age = 0))
##        1 
## 36.38919 
coef(m1)[1]
## (Intercept) 
##    36.38919 

But for the data$ version the newdata is not used at all in the actual prediction:

predict(m2, newdata = data.frame(child_age = 0))
##        1        2        3        4        5        6        7        8 
## 41.14343 41.14343 44.11483 38.46917 38.46917 44.11483 41.14343 41.14343 
##        9       10 
## 41.14343 44.11483 
## Warning message:
## 'newdata' had 1 row but variables found have 10 rows 

There are more examples like this. But this one should be serious enough to refrain from this.

How it can be applied to ctree()

If you are determined to shoot yourself in the foot with the data$ approach, you can do so with the new (and recommended) implementation of ctree() in the partykit package. The whole formula/data handling was rewritten, using standard nonstandard evaluation.

library("partykit")
ctree(adult_age ~ child_age + freq, data = data)
## Model formula:
## adult_age ~ child_age + freq
## 
## Fitted party:
## [1] root: 41.500 (n = 10, err = 490.0) 
## 
## Number of inner nodes:    0
## Number of terminal nodes: 1
ctree(data$adult_age ~ data$child_age + data$freq)
## Model formula:
## data$adult_age ~ data$child_age + data$freq
## 
## Fitted party:
## [1] root: 41.500 (n = 10, err = 490.0) 
## 
## Number of inner nodes:    0
## Number of terminal nodes: 1

Upvotes: 3

Related Questions