Reputation: 7464
Strange thing happend with dplyr today. I have 'data', a matrix with 4 columns. It's a social network: V1 & V2 are nodes connected by an edge and V3 & V4 are some labels. I was interested in summary statistics about this data set, so I used dplyr. However a strange thing happened - it gives me some kind of random results... I don't see a ground for randomness in groupping, arranging and summarizing the data. Could you tell me what could have happened in the example attached..?
Thanks!
library(dplyr)
library(magrittr)
> head(data)
V1 V2 V3 V4
[1,] "B1003" "B1051" "130000037751" "B"
[2,] "B1009" "B1054" "130000037751" "B"
[3,] "B1009" "B1033" "130000037751" "B"
[4,] "B1012" "B1036" "130000037751" "B"
[5,] "B1012" "B1066" "130000037751" "B"
[6,] "B1012" "6IIIBM" "130000037751" "B"
> data %>%
+ as.data.frame %>%
+ group_by("V3", "V4") %>%
+ summarise(count=n_distinct("V1")) %>%
+ arrange(., desc(count)) %>%
+ print
Source: local data frame [293 x 3]
Groups: V3
V3 V4 count
1 130000034371 A 179
2 130000014127 D 122
3 130000018500 A 112
4 130000028544 A 112
5 130000034057 E 108
6 130000061048 D 103
7 130000061048 A 100
8 130000042055 A 99
9 130000001997 D 98
10 130000042055 B 94
...
> data %>%
+ as.data.frame %>%
+ group_by("V3", "V4") %>%
+ summarise(count=n_distinct("V1")) %>%
+ arrange(., desc(count)) %>%
+ print
Source: local data frame [293 x 3]
Groups: V3
V3 V4 count
1 130000035777 B 129
2 130000064171 C 118
3 130000001997 D 110
4 130000034057 E 109
5 130000012718 G 95
6 130000017725 B 92
7 130000047614 B 89
8 130000005741 C 86
9 130000034037 C 78
10 130000028189 A 77
...
> data %>%
+ as.data.frame %>%
+ group_by("V3", "V4") %>%
+ summarise(count=n_distinct("V1")) %>%
+ arrange(., desc(count)) %>%
+ print
Source: local data frame [293 x 3]
Groups: V3
V3 V4 count
1 130000034371 A 162
2 130000036173 A 134
3 130000060230 E 114
4 130000060230 B 105
5 130000061592 C 99
6 130000001997 D 98
7 130000057531 B 95
8 130000028447 F 85
9 130000064171 C 85
10 130000057531 A 83
.. ... .. ...
Upvotes: 2
Views: 406
Reputation: 3259
Well, you can have a similar strange behavior when you type
summarise(mtcars, n_distinct("mpg"))
Iterated runs returned values between 16 and 24.
But this is not in accordance to the examples in the dplyr documentation. The parameters of these functions should be vectors, not character strings.
The correct variant
summarise(mtcars, n_distinct(mpg))
always returns the correct value "25".
So, try
data %>%
+ as.data.frame %>%
+ group_by(V3, V4) %>%
+ summarise(count=n_distinct(V1)) %>%
+ arrange(., desc(count)) %>%
+ print
with your data - probably this will return the correct values?
But anyway, a warning from dplyr would be nice when characters are used.
Upvotes: 4