Randomness in dplyr

Question

Strange thing happend with dplyr today. I have 'data', a matrix with 4 columns. It's a social network: V1 & V2 are nodes connected by an edge and V3 & V4 are some labels. I was interested in summary statistics about this data set, so I used dplyr. However a strange thing happened - it gives me some kind of random results... I don't see a ground for randomness in groupping, arranging and summarizing the data. Could you tell me what could have happened in the example attached..?

Thanks!

library(dplyr)
library(magrittr)

> head(data)
     V1      V2       V3             V4 
[1,] "B1003" "B1051"  "130000037751" "B"
[2,] "B1009" "B1054"  "130000037751" "B"
[3,] "B1009" "B1033"  "130000037751" "B"
[4,] "B1012" "B1036"  "130000037751" "B"
[5,] "B1012" "B1066"  "130000037751" "B"
[6,] "B1012" "6IIIBM" "130000037751" "B"

> data %>%
+   as.data.frame %>%
+   group_by("V3", "V4") %>%
+   summarise(count=n_distinct("V1")) %>%
+   arrange(., desc(count)) %>%
+   print
Source: local data frame [293 x 3]
Groups: V3

             V3 V4 count
1  130000034371  A   179
2  130000014127  D   122
3  130000018500  A   112
4  130000028544  A   112
5  130000034057  E   108
6  130000061048  D   103
7  130000061048  A   100
8  130000042055  A    99
9  130000001997  D    98
10 130000042055  B    94

...

> data %>%
+   as.data.frame %>%
+   group_by("V3", "V4") %>%
+   summarise(count=n_distinct("V1")) %>%
+   arrange(., desc(count)) %>%
+   print
Source: local data frame [293 x 3]
Groups: V3

             V3 V4 count
1  130000035777  B   129
2  130000064171  C   118
3  130000001997  D   110
4  130000034057  E   109
5  130000012718  G    95
6  130000017725  B    92
7  130000047614  B    89
8  130000005741  C    86
9  130000034037  C    78
10 130000028189  A    77

...

> data %>%
+   as.data.frame %>%
+   group_by("V3", "V4") %>%
+   summarise(count=n_distinct("V1")) %>%
+   arrange(., desc(count)) %>%
+   print
Source: local data frame [293 x 3]
Groups: V3

             V3 V4 count
1  130000034371  A   162
2  130000036173  A   134
3  130000060230  E   114
4  130000060230  B   105
5  130000061592  C    99
6  130000001997  D    98
7  130000057531  B    95
8  130000028447  F    85
9  130000064171  C    85
10 130000057531  A    83
..          ... ..   ...

Patrick Roocks · Accepted Answer

Well, you can have a similar strange behavior when you type

summarise(mtcars, n_distinct("mpg"))

Iterated runs returned values between 16 and 24.

But this is not in accordance to the examples in the dplyr documentation. The parameters of these functions should be vectors, not character strings.

The correct variant

 summarise(mtcars, n_distinct(mpg))

always returns the correct value "25".

So, try

data %>%
+   as.data.frame %>%
+   group_by(V3, V4) %>%
+   summarise(count=n_distinct(V1)) %>%
+   arrange(., desc(count)) %>%
+   print

with your data - probably this will return the correct values?

But anyway, a warning from dplyr would be nice when characters are used.

Randomness in dplyr

Answers (1)

Related Questions