Reputation: 1437
Does plyr skip missing levels of a factor [that is the grouping variable]? It's the first of my questions in diagnosing a problem.
I have a dataset where patients are in strata=rural
or strata=city
. I want to compare the age
in treatment=A
to treatment=B
.
For example, I am trying to do:
ddply(data.c, .(strata), function(x) t.test(age~treatment, data=x, na.rm=TRUE)
But it tells me
Error in t.test.formula(age ~ treatment, data = x, na.rm = TRUE) : grouping factor must have exactly 2 levels
If I run factor(data$strata)
and factor(data$treatment)
I see only two levels, respectively (they are two labels, respectively; that's not a problem, right?).
Does plyr think that NA is a level in the grouping factor? What might be the problem for the error message?
I've been Googling and looking on stackoverflow to answer the question. I'm pretty new to R but I could not find an answer. Any help is much appreciated.
So I have some example code. Chase's example worked nicely but when I use my data, I still get the same error.
dlply(data.c, .(strata), function(x) t.test(age~treatment, data=x, na.rm=TRUE
produced the same error.
I don't understand why this is.
Below is my data.c
strata age treatment
1 1 57.8630136986301 p_3
2 0 52.958904109589 p_3
3 NA 97.2438356164384 p12
4 0 88.2027397260274 p12
5 1 77.5890410958904 p12
6 1 43.9123287671233 p12
7 1 63.5260273972603 p_3
8 1 42.1890410958904 p12
9 0 52.1753424657534 p12
10 1 65.6493150684932 p12
11 0 44.9835616438356 p12
12 1 64.8849315068493 p12
13 1 57.5835616438356 p12
14 0 47.3013698630137 p12
15 0 74.0356164383562 NA
16 0 65.4986301369863 p12
17 1 83.986301369863 p12
18 0 47.4904109589041 p12
19 1 47.8630136986301 p12
20 1 58.8520547945205 p12
21 1 61.3342465753425 p12
22 1 66.841095890411 p12
23 1 55.6383561643836 p12
24 1 52.7178082191781 p12
25 1 71.4630136986301 p12
26 1 NA p12
27 1 59.2082191780822 p12
28 1 69.8575342465753 p12
29 1 46.7397260273973 p12
30 1 53.5013698630137 p_3
31 1 41.3205479452055 p12
32 0 51.3917808219178 p_3
33 1 47.8684931506849 p12
34 1 87.654794520548 p12
35 0 75.558904109589 p12
36 1 71.2520547945205 p12
37 1 44.9808219178082 p_3
38 1 52 p_3
39 1 54.7643835616438 p_3
40 1 85.6630136986301 p_3
41 1 74.1835616438356 p_3
42 1 56.8684931506849 p_3
43 1 87.572602739726 p_3
44 1 85.0109589041096 p_3
45 1 70.0767123287671 p_3
46 0 47.2328767123288 p12
47 1 63.7972602739726 p12
48 1 85.8054794520548 p12
49 1 67.027397260274 p12
50 1 60.7342465753425 p12
51 0 61.9479452054794 p_3
52 0 86.8712328767123 p12
53 1 87.8219178082192 p12
54 1 49.9424657534247 p12
55 0 83.386301369863 p12
56 1 88.3013698630137 p12
57 1 55.7890410958904 p12
58 1 63.7616438356164 p12
59 1 55.5041095890411 p12
60 1 43.5232876712329 p12
61 1 58.8246575342466 p12
62 0 46.7397260273973 p12
63 0 74.2027397260274 p_3
64 0 51.9205479452055 p_3
65 0 78.1890410958904 p_3
66 0 78.9917808219178 p_3
Upvotes: 0
Views: 2158
Reputation: 58875
plyr
does treat missing as a separate group for the purposes of grouping variables. Therefore, in your example, there are three groups: 0
, 1
, and NA
. You can see which of the groups is throwing an error by adding the .inform
argument:
dlply(data.c,
.(strata),
function(x) t.test(age~treatment, data=x, na.rm=TRUE),
.inform=TRUE)
which reports:
Error in t.test.formula(age ~ treatment, data = x, na.rm = TRUE) :
grouping factor must have exactly 2 levels
Error: with piece 3:
strata age treatment
3 NA 97.24384 p12
And in fact there is just one row with strata
of NA
, and doing a t.test
on a single data point (or rather, where all the data points are in the same group) will give the error you see:
> t.test(age~treatment, data=data.c[is.na(data.c$strata),], na.rm=TRUE)
Error in t.test.formula(age ~ treatment, data = data.c[is.na(data.c$strata), :
grouping factor must have exactly 2 levels
To get around this problem, you can either subset data.c
in the dlply
call:
dlply(data.c[!is.na(data.c$strata),],
.(strata),
function(x) t.test(age~treatment, data=x, na.rm=TRUE))
(which assumes that you know where the problem will be to work around it) and which gives:
$`0`
Welch Two Sample t-test
data: age by treatment
t = 0.2653, df = 15.729, p-value = 0.7942
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.41565 17.24792
sample estimates:
mean in group p_3 mean in group p12
64.22896 62.31283
$`1`
Welch Two Sample t-test
data: age by treatment
t = 0.6606, df = 18.612, p-value = 0.517
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.998471 13.440397
sample estimates:
mean in group p_3 mean in group p12
65.50091 62.27995
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
strata
1 0
2 1
or you can protect the call with error catching. plyr
provides a convenience wrapper try_default
for such a purpose.
dlply(data.c,
.(strata),
function(x) try_default(t.test(age~treatment, data=x, na.rm=TRUE),
NULL))
which gives
Error in t.test.formula(age ~ treatment, data = x, na.rm = TRUE) :
grouping factor must have exactly 2 levels
$`0`
Welch Two Sample t-test
data: age by treatment
t = 0.2653, df = 15.729, p-value = 0.7942
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.41565 17.24792
sample estimates:
mean in group p_3 mean in group p12
64.22896 62.31283
$`1`
Welch Two Sample t-test
data: age by treatment
t = 0.6606, df = 18.612, p-value = 0.517
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.998471 13.440397
sample estimates:
mean in group p_3 mean in group p12
65.50091 62.27995
$`NA`
NULL
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
strata
1 0
2 1
3 NA
Note that there are 3 elements to this output including one for NA
which is set to NULL
due to the try_default
.
Upvotes: 1
Reputation: 69251
I think you will need to provide us some sample data as I'm not able to duplicate your error. I do get an error about ddply()
complaining that the following:
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results must be all atomic, or all data frames
This is because the output from t.test()
is not a data.frame but a list of class htest
. So simply telling plyr to return a list object instead fixes the problem. If you want to extract a specific value from the list, you can probably get a data.frame back.
Here's how I set up your data:
x <- data.frame(strata = sample(letters[1:2], 100, TRUE),
age = sample(18:65, 100, TRUE),
treatment = sample(LETTERS[1:2], 100, TRUE))
x[sample(nrow(x), 10, FALSE), "strata"] <- NA
x[sample(nrow(x), 10, FALSE), "treatment"] <- NA
Then simply calling function dlply()
instead works:
out <- dlply(x, .(strata), function(x) t.test(age~treatment, data = x, na.rm=TRUE))
And to answer your question, yes - plyr seems to treat NA as a valid group, as evidenced by the length and names out the out
list object:
> names(out)
[1] "a" "b" "NA"
As I said, a better description of you problem may result in a better answer - but hopefully this gets you on the right path.
Upvotes: 2