Reputation: 6312
(Not sure whether this question belongs to CrossValidated or Stackoverflow)
Subset of my data:
mdat1 <- structure(list(Name = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Bilbao",
"San Sebastian", "Vitoria"), class = "factor"), PrecipTotal = c(0,
1.01600203200406, 0, 6.09601219202438, 73.4061468122936, 4.31800863601727,
0, 0.254000508001016, 7.8740157480315, 5.58801117602235, 0, 0,
0, 0, 2.03200406400813, 0, 0.254000508001016, 0, 2.03200406400813,
0, 0, 0, 57.9121158242316, 1.77800355600711, 0, 0.762001524003048,
6.3500127000254, 0, 0, 1.27000254000508, 8.89001778003556, 1.01600203200406,
0, 0, 0, 0, 0.762001524003048, 0, 8.89001778003556, 0, 0, 21.8440436880874,
0, 0.508001016002032, 0, 0.508001016002032, 0.508001016002032,
0, 0, 0, 14.4780289560579, 0.254000508001016, 0.508001016002032,
0, 23.3680467360935, 6.09601219202438, 0, 0, 0, 0, 28.1940563881128,
0, 0, 0, 3.04800609601219, 0, 0, 0, 0, 6.09601219202438, 0, 2.03200406400813,
0, 4.06400812801626, 0, 0.508001016002032, 0, 0, 0.508001016002032,
7.11201422402845, 34.0360680721361, 0, 0, 0, 7.8740157480315,
0, 4.06400812801626, 0, 0, 0.508001016002032, 5.08001016002032,
7.11201422402845, 7.11201422402845, 0, 0, 0, 1.01600203200406,
0, 0, 0), Hail = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Hail",
"NoHail"), class = "factor")), .Names = c("Name", "PrecipTotal",
"Hail"), row.names = c(43878L, 33821L, 40681L, 35121L, 45112L,
46428L, 45844L, 43199L, 34440L, 43184L, 32850L, 39220L, 38416L,
33860L, 34867L, 32737L, 43232L, 31772L, 35850L, 38894L, 39289L,
33148L, 32159L, 43197L, 43962L, 45068L, 41848L, 35929L, 34842L,
42069L, 39503L, 31747L, 43286L, 34919L, 43925L, 45368L, 42489L,
41686L, 43194L, 34747L, 37001L, 42923L, 45006L, 46170L, 33191L,
34392L, 44047L, 35859L, 42159L, 38843L, 45860L, 34180L, 33846L,
42810L, 46160L, 33523L, 34840L, 40226L, 42868L, 43576L, 46570L,
39980L, 42453L, 42063L, 38121L, 32822L, 40670L, 32859L, 46228L,
40239L, 32420L, 38874L, 39638L, 39523L, 31765L, 32753L, 33752L,
35574L, 36263L, 32871L, 32539L, 38455L, 41119L, 45124L, 34560L,
34144L, 41461L, 41449L, 35499L, 42783L, 34106L, 38151L, 36313L,
46593L, 39973L, 43928L, 35240L, 43626L, 46195L, 44388L), class = "data.frame")
Using the following code
cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", main = "1",
xlab = "",
col = c("purple", "gray"))
creates a messed up output ("1") of cdplot()
. Using a different sample of my original data produces the output labeled with "2"
I assume that it has something to do with the distribution of the x-values? If they are extremely skewed (like for "1"), the density calculation gets in trouble?
Upvotes: 1
Views: 427
Reputation: 263481
I'd say it was simply a bug, albeit one that you are warned rather vaguely about when the help page says "conditional densities are more reliable for high-density regions of x". Contrast all these efforts with the result you get with lattice's densityplot
. (Much cleaner and informative in my opinion.) The cdplot
and ggplot
efforts appear to seriously distort the data.
library(lattice)
densityplot(~PrecipTotal, groups=Hail, mdat1, col = c("purple", "gray"))
You can contrast that disply of the data with the output from that less pathological appearance you get from:
cdplot(Hail ~ PrecipTotal, data=mdat1, bw=2)
... but that still leaves you with the impression that there is substantial difference in the densities of the two groups in the region of 45-65 whereas the side-by-side display should you htat there is a gap in one and a single point in the other group which seems far more easily explained by random variation.
There is the fine point to be made that the lattice plotting argument convention is that separate plots result from a formula specification that includes the grouping variable, whereas bringing the grouping in with the groups=
mechanism includes them in the same plot region.
Upvotes: 2
Reputation: 215127
Here is how it looks when I simply adjust the bw
parameter without modifying your data, so I would say just play with the bw
parameter.
cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "",
xlab = "",
col = c("purple", "gray"), bw = 1)
cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "",
xlab = "",
col = c("purple", "gray"), bw = 2)
Upvotes: 2
Reputation: 10781
I think you might want to consider transforming your PrecipTotal
variable first, then create a conditional density plot. After monkeying around a bit, it seems like taking the sqrt
of the variable might be sufficient. We may also need to adjust the binwidth
to get a better looking plot.
Obviously, these transformations and adjustment require us to be very careful about our interpretation of the relationship.
R
using cdplot
cdplot(Hail ~ sqrt(PrecipTotal), data = mdat1)
ggplot2
using geom_density
and position = 'fill'
library(ggplot2)
ggplot(mdat1, aes(sqrt(PrecipTotal)))+
geom_density(aes(fill = Hail), position = 'fill')+
theme_bw()
ggplot2
with some optionsggplot(mdat1, aes(sqrt(PrecipTotal)))+
geom_density(aes(fill = Hail), position = 'fill',
kernel = 'cosine', adjust = 1.1)+
theme_bw()
Upvotes: 1