Reputation: 313
I am using the ctree()
in the party package from R. I want to be able to columns from more than one dataframe, for which I would call use column separately (using $
) as I have in the past with this function but this time it is not working.
For the purposes of illustrating the error, I've put together a sample data set as a single dataframe. When I run:
>ctree(data$adult_age~data$child_age+data$freq)
I get the following error:
>Error in model.frame.default(formula = ~data$adult_age, data = list(), :
invalid type (NULL) for variable 'data$adult_age'
If I run it like this, it works:
>ctree(adult_age~child_age+freq, data)
Usually those two ways of writing it out are interchangeable (e.g. with lm()
I get the same results with both), but with ctree()
I am running into an error. Why? How can I fix this so that I can pull from different dataframes at once without combining them?
My data structure looks like this:
> dput(data)
>structure(list(adult_age = c(38, 38, 38, 38, 38, 55.5, 55.5, 38, 38, 38), child_age = c(8, 8, 13, 3.5, 3.5, 13, 8, 8, 8, 13), freq = c(0.1, 12, 0.1, 0.1, 0.1, 0.1, 1, 2, 0.1, 0.1)), .Names = c("adult_age", "child_age", "freq"), class = "data.frame", row.names = c(12L, 13L, 14L, 15L, 18L, 20L, 22L, 23L, 24L, 25L))
If you want to run the sample data:
>adult_age = c(38, 38, 38, 38, 38, 55.5, 55.5, 38, 38, 38)
>child_age = c(8, 8, 13, 3.5, 3.5, 13, 8, 8, 8, 13)
>freq = c(0.1, 12, 0.1, 0.1, 0.1, 0.1, 1, 2, 0.1, 0.1)
>data=as.data.frame(cbind(adult_age, child_age, freq))
Upvotes: 2
Views: 6880
Reputation: 17168
Never use data$
inside model formulas (as already pointed out by @Roland). Apart from the fact that you unnecessarily repeat the data name and have to type more, it is a source of confusion and errors. If you haven't encountered this problem, yet, with lm()
then you haven't used predict()
. Consider a simple linear regression for your data
:
m1 <- lm(adult_age ~ child_age, data = data)
m2 <- lm(data$adult_age ~ data$child_age)
coef(m1) - coef(m2)
## (Intercept) child_age
## 0 0
Thus, both approaches lead to the same coefficient estimates etc. But in all situations where you want to use the same formula with a different/updated/subsetted data, you run into trouble. Prominently, in predict()
, e.g., when making a prediction at child_age = 0
. The intended usage with formula and data separated correctly recovers the intercept:
predict(m1, newdata = data.frame(child_age = 0))
## 1
## 36.38919
coef(m1)[1]
## (Intercept)
## 36.38919
But for the data$
version the newdata
is not used at all in the actual prediction:
predict(m2, newdata = data.frame(child_age = 0))
## 1 2 3 4 5 6 7 8
## 41.14343 41.14343 44.11483 38.46917 38.46917 44.11483 41.14343 41.14343
## 9 10
## 41.14343 44.11483
## Warning message:
## 'newdata' had 1 row but variables found have 10 rows
There are more examples like this. But this one should be serious enough to refrain from this.
ctree()
If you are determined to shoot yourself in the foot with the data$
approach, you can do so with the new (and recommended) implementation of ctree()
in the partykit
package. The whole formula/data handling was rewritten, using standard nonstandard evaluation.
library("partykit")
ctree(adult_age ~ child_age + freq, data = data)
## Model formula:
## adult_age ~ child_age + freq
##
## Fitted party:
## [1] root: 41.500 (n = 10, err = 490.0)
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
ctree(data$adult_age ~ data$child_age + data$freq)
## Model formula:
## data$adult_age ~ data$child_age + data$freq
##
## Fitted party:
## [1] root: 41.500 (n = 10, err = 490.0)
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
Upvotes: 3