pat-s
pat-s

Reputation: 6312

Problems with density calculation of cdplot() in R

(Not sure whether this question belongs to CrossValidated or Stackoverflow)

Subset of my data:

mdat1 <- structure(list(Name = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Bilbao", 
"San Sebastian", "Vitoria"), class = "factor"), PrecipTotal = c(0, 
1.01600203200406, 0, 6.09601219202438, 73.4061468122936, 4.31800863601727, 
0, 0.254000508001016, 7.8740157480315, 5.58801117602235, 0, 0, 
0, 0, 2.03200406400813, 0, 0.254000508001016, 0, 2.03200406400813, 
0, 0, 0, 57.9121158242316, 1.77800355600711, 0, 0.762001524003048, 
6.3500127000254, 0, 0, 1.27000254000508, 8.89001778003556, 1.01600203200406, 
0, 0, 0, 0, 0.762001524003048, 0, 8.89001778003556, 0, 0, 21.8440436880874, 
0, 0.508001016002032, 0, 0.508001016002032, 0.508001016002032, 
0, 0, 0, 14.4780289560579, 0.254000508001016, 0.508001016002032, 
0, 23.3680467360935, 6.09601219202438, 0, 0, 0, 0, 28.1940563881128, 
0, 0, 0, 3.04800609601219, 0, 0, 0, 0, 6.09601219202438, 0, 2.03200406400813, 
0, 4.06400812801626, 0, 0.508001016002032, 0, 0, 0.508001016002032, 
7.11201422402845, 34.0360680721361, 0, 0, 0, 7.8740157480315, 
0, 4.06400812801626, 0, 0, 0.508001016002032, 5.08001016002032, 
7.11201422402845, 7.11201422402845, 0, 0, 0, 1.01600203200406, 
0, 0, 0), Hail = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Hail", 
"NoHail"), class = "factor")), .Names = c("Name", "PrecipTotal", 
"Hail"), row.names = c(43878L, 33821L, 40681L, 35121L, 45112L, 
46428L, 45844L, 43199L, 34440L, 43184L, 32850L, 39220L, 38416L, 
33860L, 34867L, 32737L, 43232L, 31772L, 35850L, 38894L, 39289L, 
33148L, 32159L, 43197L, 43962L, 45068L, 41848L, 35929L, 34842L, 
42069L, 39503L, 31747L, 43286L, 34919L, 43925L, 45368L, 42489L, 
41686L, 43194L, 34747L, 37001L, 42923L, 45006L, 46170L, 33191L, 
34392L, 44047L, 35859L, 42159L, 38843L, 45860L, 34180L, 33846L, 
42810L, 46160L, 33523L, 34840L, 40226L, 42868L, 43576L, 46570L, 
39980L, 42453L, 42063L, 38121L, 32822L, 40670L, 32859L, 46228L, 
40239L, 32420L, 38874L, 39638L, 39523L, 31765L, 32753L, 33752L, 
35574L, 36263L, 32871L, 32539L, 38455L, 41119L, 45124L, 34560L, 
34144L, 41461L, 41449L, 35499L, 42783L, 34106L, 38151L, 36313L, 
46593L, 39973L, 43928L, 35240L, 43626L, 46195L, 44388L), class = "data.frame")

Using the following code

cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", main = "1", 
                 xlab = "", 
                 col = c("purple", "gray"))

creates a messed up output ("1") of cdplot(). Using a different sample of my original data produces the output labeled with "2"

enter image description here

I assume that it has something to do with the distribution of the x-values? If they are extremely skewed (like for "1"), the density calculation gets in trouble?

enter image description here

Upvotes: 1

Views: 427

Answers (3)

IRTFM
IRTFM

Reputation: 263481

I'd say it was simply a bug, albeit one that you are warned rather vaguely about when the help page says "conditional densities are more reliable for high-density regions of x". Contrast all these efforts with the result you get with lattice's densityplot. (Much cleaner and informative in my opinion.) The cdplot and ggplot efforts appear to seriously distort the data.

library(lattice)
densityplot(~PrecipTotal, groups=Hail, mdat1, col = c("purple", "gray"))

You can contrast that disply of the data with the output from that less pathological appearance you get from:

cdplot(Hail ~ PrecipTotal, data=mdat1, bw=2)

... but that still leaves you with the impression that there is substantial difference in the densities of the two groups in the region of 45-65 whereas the side-by-side display should you htat there is a gap in one and a single point in the other group which seems far more easily explained by random variation.

enter image description here

There is the fine point to be made that the lattice plotting argument convention is that separate plots result from a formula specification that includes the grouping variable, whereas bringing the grouping in with the groups= mechanism includes them in the same plot region.

Upvotes: 2

akuiper
akuiper

Reputation: 215127

Here is how it looks when I simply adjust the bw parameter without modifying your data, so I would say just play with the bw parameter.

cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", 
              xlab = "", 
              col = c("purple", "gray"), bw = 1)

enter image description here

cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", 
              xlab = "", 
              col = c("purple", "gray"), bw = 2)

enter image description here

Upvotes: 2

bouncyball
bouncyball

Reputation: 10781

I think you might want to consider transforming your PrecipTotal variable first, then create a conditional density plot. After monkeying around a bit, it seems like taking the sqrt of the variable might be sufficient. We may also need to adjust the binwidth to get a better looking plot.

Obviously, these transformations and adjustment require us to be very careful about our interpretation of the relationship.

Base R using cdplot

cdplot(Hail ~ sqrt(PrecipTotal), data = mdat1)

enter image description here

ggplot2 using geom_density and position = 'fill'

library(ggplot2)
ggplot(mdat1, aes(sqrt(PrecipTotal)))+
    geom_density(aes(fill = Hail), position = 'fill')+
    theme_bw()

enter image description here

ggplot2 with some options

ggplot(mdat1, aes(sqrt(PrecipTotal)))+
    geom_density(aes(fill = Hail), position = 'fill',
                 kernel = 'cosine', adjust = 1.1)+
    theme_bw()

enter image description here

Upvotes: 1

Related Questions