Tim
Tim

Reputation: 7464

Randomness in dplyr

Strange thing happend with dplyr today. I have 'data', a matrix with 4 columns. It's a social network: V1 & V2 are nodes connected by an edge and V3 & V4 are some labels. I was interested in summary statistics about this data set, so I used dplyr. However a strange thing happened - it gives me some kind of random results... I don't see a ground for randomness in groupping, arranging and summarizing the data. Could you tell me what could have happened in the example attached..?

Thanks!

library(dplyr)
library(magrittr)

> head(data)
     V1      V2       V3             V4 
[1,] "B1003" "B1051"  "130000037751" "B"
[2,] "B1009" "B1054"  "130000037751" "B"
[3,] "B1009" "B1033"  "130000037751" "B"
[4,] "B1012" "B1036"  "130000037751" "B"
[5,] "B1012" "B1066"  "130000037751" "B"
[6,] "B1012" "6IIIBM" "130000037751" "B"

> data %>%
+   as.data.frame %>%
+   group_by("V3", "V4") %>%
+   summarise(count=n_distinct("V1")) %>%
+   arrange(., desc(count)) %>%
+   print
Source: local data frame [293 x 3]
Groups: V3

             V3 V4 count
1  130000034371  A   179
2  130000014127  D   122
3  130000018500  A   112
4  130000028544  A   112
5  130000034057  E   108
6  130000061048  D   103
7  130000061048  A   100
8  130000042055  A    99
9  130000001997  D    98
10 130000042055  B    94

...

> data %>%
+   as.data.frame %>%
+   group_by("V3", "V4") %>%
+   summarise(count=n_distinct("V1")) %>%
+   arrange(., desc(count)) %>%
+   print
Source: local data frame [293 x 3]
Groups: V3

             V3 V4 count
1  130000035777  B   129
2  130000064171  C   118
3  130000001997  D   110
4  130000034057  E   109
5  130000012718  G    95
6  130000017725  B    92
7  130000047614  B    89
8  130000005741  C    86
9  130000034037  C    78
10 130000028189  A    77

...

> data %>%
+   as.data.frame %>%
+   group_by("V3", "V4") %>%
+   summarise(count=n_distinct("V1")) %>%
+   arrange(., desc(count)) %>%
+   print
Source: local data frame [293 x 3]
Groups: V3

             V3 V4 count
1  130000034371  A   162
2  130000036173  A   134
3  130000060230  E   114
4  130000060230  B   105
5  130000061592  C    99
6  130000001997  D    98
7  130000057531  B    95
8  130000028447  F    85
9  130000064171  C    85
10 130000057531  A    83
..          ... ..   ...

Upvotes: 2

Views: 406

Answers (1)

Patrick Roocks
Patrick Roocks

Reputation: 3259

Well, you can have a similar strange behavior when you type

summarise(mtcars, n_distinct("mpg"))

Iterated runs returned values between 16 and 24.

But this is not in accordance to the examples in the dplyr documentation. The parameters of these functions should be vectors, not character strings.

The correct variant

 summarise(mtcars, n_distinct(mpg))

always returns the correct value "25".

So, try

data %>%
+   as.data.frame %>%
+   group_by(V3, V4) %>%
+   summarise(count=n_distinct(V1)) %>%
+   arrange(., desc(count)) %>%
+   print

with your data - probably this will return the correct values?

But anyway, a warning from dplyr would be nice when characters are used.

Upvotes: 4

Related Questions