Reputation: 5465
I'm using the lmtree()
function from partykit
to partition data using linear regressions. The regressions use a weight, and I want to ensure that each branch has a minimum total weight, which I specify with the minsize
option. For instance, in the following example the tree only has two branches instead of three because x1=="C"
has too small a weight to be in its own branch.
n <- 100
X <- rbind(
data.frame(TT=1:n, x1="A", weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
data.frame(TT=1:n, x1="B", weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
data.frame(TT=1:n, x1="C", weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)
X$x1 <- factor(X$x1)
tr <- lmtree(y ~ TT | x1, data=X, weight=weight, minsize=150)
Fitted party:
[1] root
| [2] x1 in A: n = 200
| (Intercept) TT
| 0.7724903 0.2002023
| [3] x1 in B, C: n = 300
| (Intercept) TT
| 0.5759213 0.4659592
I also have some real-world data that unfortunately is confidential but is leading to some behavior that I do not understand. When I do not specify minsize
it builds a tree with 30 branches, where in every branch the total weight n
is a large number. However, when I specify a minsize
that is well below the total weight of every branch from this first tree the result is a new tree with many fewer branches. I would not have expected the tree to change at all because it seems that minsize
is not binding. Is there any explanation for this result?
UPDATE
Providing an example
n <- 100
X <- rbind(
data.frame(TT=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
data.frame(TT=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
data.frame(TT=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)
tr <- lmtree(y ~ TT | x1, data=X, weights = weight)
Fitted party:
[1] root
| [2] x1 <= 0.29787: n = 200
| (Intercept) TT
| 0.8431985 0.1994021
| [3] x1 > 0.29787
| | [4] x1 <= 0.69515: n = 200
| | (Intercept) TT
| | 0.6346980 0.3995678
| | [5] x1 > 0.69515: n = 100
| | (Intercept) TT
| | 0.4792462 0.5987472
Now let's set minsize=150
. The tree no longer has any splits even though x1 <= 0.3
and x1 > 0.3
would work.
tr <- lmtree(y ~ TT | x1, data=X, weights = weight, minsize=150)
Fitted party:
[1] root: n = 500
(Intercept) TT
0.6870078 0.3593374
Upvotes: 3
Views: 253
Reputation: 17203
Two rules applied in mob()
(the infrastructure underlying lmtree()
) are important in this context which may benefit from more explicit discussion:
If mob()
selects a splitting variable at any stage that then does not lead to a single admissible split (in terms of minimal node size), then splitting stops at that point. This is in contrast to ctree()
which always performs a split if a significant test was detected - even if the second-best variable was non-significant. It would probably be good to offer more granular control over this - and we have it on our wishlist for the upcoming revision of the package.
By default the weights
are interpreted as case weights, i.e., mob()
thinks that there were w
independent observations identical to the given one. Thus, the number of observations is the sum of weights. But note that this also affects the significance tests for which the sample size increases!
As for your main question: It's hard to come up with an explanation without any reproducible example. I agree that partykit
should behave in the way you describe it - but maybe there is one important but not so obvious detail that you haven't noticed yet... It would be good if you could come up with a small/simple artificial data set that replicates the problem.
As already pointed out in the comments: Thanks for the reproducible example in your updated question. This helped me track down a bug in mob()
in handling case weights. There was an error in the computation of the test statistic in the presence of case weights, thus leading to incorrect split variable selection and stopping criterion. I have just fixed this bug and the new partykit
development version is available from R-Forge at https://r-forge.r-project.org/R/?group_id=261. (Note, however, that R-Forge at the moment only builds Windows binaries for R 3.3.x. If a more recent Windows version is used, please use type = "source"
to install the source package - and make sure you have the necessary Rtools installed.)
In your example I just set a random seed for exact reproducibility. The weighted data is set up as:
set.seed(1)
n <- 100
X <- rbind(
data.frame(TT=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),
data.frame(TT=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),
data.frame(TT=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2))
)
Then the weighted tree can be fitted as before. In this particular example the tree structure remains unaffected but the test statistics and p-values of the parameter instability test in each node changes somewaht:
library("partykit")
tr1 <- lmtree(y ~ TT | x1, data = X, weights = weight)
plot(tr1)
Adding the minsize = 150
argument now has the expected effect of just avoiding the split in node 3.
tr2 <- lmtree(y ~ TT | x1, data = X, weights = weight, minsize = 150)
plot(tr2)
To check that the latter actually does the right thing we compare it with the tree for the explicitly expanded data. Thus, as the data are regarded as case weights here, we can inflate the data set by repeating thos observations with weights greater than 1.
Xw <- X[rep(1:nrow(X), X$weight), ]
tr3 <- lmtree(y ~ TT | x1, data = Xw, minsize = 150)
The resulting coefficients are the same (up to very small numerical differences):
all.equal(coef(tr2), coef(tr3))
## [1] TRUE
And, more importantly, all test statistics and p-values in the nodes are also the same:
library("strucchange")
all.equal(sctest(tr2), sctest(tr3))
## [1] TRUE
Upvotes: 1