Reputation: 6312

Problems with density calculation of cdplot() in R

(Not sure whether this question belongs to CrossValidated or Stackoverflow)

Subset of my data:

mdat1 <- structure(list(Name = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Bilbao", 
"San Sebastian", "Vitoria"), class = "factor"), PrecipTotal = c(0, 
1.01600203200406, 0, 6.09601219202438, 73.4061468122936, 4.31800863601727, 
0, 0.254000508001016, 7.8740157480315, 5.58801117602235, 0, 0, 
0, 0, 2.03200406400813, 0, 0.254000508001016, 0, 2.03200406400813, 
0, 0, 0, 57.9121158242316, 1.77800355600711, 0, 0.762001524003048, 
6.3500127000254, 0, 0, 1.27000254000508, 8.89001778003556, 1.01600203200406, 
0, 0, 0, 0, 0.762001524003048, 0, 8.89001778003556, 0, 0, 21.8440436880874, 
0, 0.508001016002032, 0, 0.508001016002032, 0.508001016002032, 
0, 0, 0, 14.4780289560579, 0.254000508001016, 0.508001016002032, 
0, 23.3680467360935, 6.09601219202438, 0, 0, 0, 0, 28.1940563881128, 
0, 0, 0, 3.04800609601219, 0, 0, 0, 0, 6.09601219202438, 0, 2.03200406400813, 
0, 4.06400812801626, 0, 0.508001016002032, 0, 0, 0.508001016002032, 
7.11201422402845, 34.0360680721361, 0, 0, 0, 7.8740157480315, 
0, 4.06400812801626, 0, 0, 0.508001016002032, 5.08001016002032, 
7.11201422402845, 7.11201422402845, 0, 0, 0, 1.01600203200406, 
0, 0, 0), Hail = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Hail", 
"NoHail"), class = "factor")), .Names = c("Name", "PrecipTotal", 
"Hail"), row.names = c(43878L, 33821L, 40681L, 35121L, 45112L, 
46428L, 45844L, 43199L, 34440L, 43184L, 32850L, 39220L, 38416L, 
33860L, 34867L, 32737L, 43232L, 31772L, 35850L, 38894L, 39289L, 
33148L, 32159L, 43197L, 43962L, 45068L, 41848L, 35929L, 34842L, 
42069L, 39503L, 31747L, 43286L, 34919L, 43925L, 45368L, 42489L, 
41686L, 43194L, 34747L, 37001L, 42923L, 45006L, 46170L, 33191L, 
34392L, 44047L, 35859L, 42159L, 38843L, 45860L, 34180L, 33846L, 
42810L, 46160L, 33523L, 34840L, 40226L, 42868L, 43576L, 46570L, 
39980L, 42453L, 42063L, 38121L, 32822L, 40670L, 32859L, 46228L, 
40239L, 32420L, 38874L, 39638L, 39523L, 31765L, 32753L, 33752L, 
35574L, 36263L, 32871L, 32539L, 38455L, 41119L, 45124L, 34560L, 
34144L, 41461L, 41449L, 35499L, 42783L, 34106L, 38151L, 36313L, 
46593L, 39973L, 43928L, 35240L, 43626L, 46195L, 44388L), class = "data.frame")

Using the following code

cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", main = "1", 
                 xlab = "", 
                 col = c("purple", "gray"))

creates a messed up output ("1") of cdplot(). Using a different sample of my original data produces the output labeled with "2"

I assume that it has something to do with the distribution of the x-values? If they are extremely skewed (like for "1"), the density calculation gets in trouble?

Upvotes: 1

Answers (3)

IRTFM

Reputation: 263481

I'd say it was simply a bug, albeit one that you are warned rather vaguely about when the help page says "conditional densities are more reliable for high-density regions of x". Contrast all these efforts with the result you get with lattice's densityplot. (Much cleaner and informative in my opinion.) The cdplot and ggplot efforts appear to seriously distort the data.

library(lattice)
densityplot(~PrecipTotal, groups=Hail, mdat1, col = c("purple", "gray"))

You can contrast that disply of the data with the output from that less pathological appearance you get from:

cdplot(Hail ~ PrecipTotal, data=mdat1, bw=2)

... but that still leaves you with the impression that there is substantial difference in the densities of the two groups in the region of 45-65 whereas the side-by-side display should you htat there is a gap in one and a single point in the other group which seems far more easily explained by random variation.

There is the fine point to be made that the lattice plotting argument convention is that separate plots result from a formula specification that includes the grouping variable, whereas bringing the grouping in with the groups= mechanism includes them in the same plot region.

Upvotes: 2

akuiper

Reputation: 215127

Here is how it looks when I simply adjust the bw parameter without modifying your data, so I would say just play with the bw parameter.

cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", 
              xlab = "", 
              col = c("purple", "gray"), bw = 1)

cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", 
              xlab = "", 
              col = c("purple", "gray"), bw = 2)

Upvotes: 2

bouncyball

Reputation: 10781

I think you might want to consider transforming your PrecipTotal variable first, then create a conditional density plot. After monkeying around a bit, it seems like taking the sqrt of the variable might be sufficient. We may also need to adjust the binwidth to get a better looking plot.

Obviously, these transformations and adjustment require us to be very careful about our interpretation of the relationship.

Base `R` using `cdplot`

cdplot(Hail ~ sqrt(PrecipTotal), data = mdat1)

`ggplot2` using `geom_density` and `position = 'fill'`

library(ggplot2)
ggplot(mdat1, aes(sqrt(PrecipTotal)))+
    geom_density(aes(fill = Hail), position = 'fill')+
    theme_bw()

`ggplot2` with some options

ggplot(mdat1, aes(sqrt(PrecipTotal)))+
    geom_density(aes(fill = Hail), position = 'fill',
                 kernel = 'cosine', adjust = 1.1)+
    theme_bw()

Upvotes: 1

Problems with density calculation of cdplot() in R

Answers (3)

Base R using cdplot

ggplot2 using geom_density and position = 'fill'

ggplot2 with some options

Related Questions

Base `R` using `cdplot`

`ggplot2` using `geom_density` and `position = 'fill'`

`ggplot2` with some options