Reputation: 185
I am storing (x, y)
values in a dataframe. I want to return the most frequently appearing (x, y)
combination.
Here is an example:
> x = c(1, 1, 2, 3, 4, 5, 6)
> y = c(1, 1, 5, 6, 9, 10, 12)
> xy = data.frame(x, y)
> xy
x y
1 1 1
2 1 1
3 2 5
4 3 6
5 4 9
6 5 10
7 6 12
The most common (x, y)
value would be (1, 1)
.
I tried the answer here for a single column. It works for a single column, but does not work for an aggregate of two columns.
> tail(names(sort(table(xy$x))), 1)
[1] "1"
> tail(names(sort(table(xy$x, xy$y))), 1)
NULL
How do I retrieve the most repeated (x, y) values in two columns in a data frame in R?
EDIT: c(1, 2)
should be considered distinct from c(2, 1)
.
Upvotes: 12
Views: 1605
Reputation: 13570
Using sqldf
:
library(sqldf)
sqldf('SELECT x, y
FROM xy
GROUP BY (x||y)
ORDER BY COUNT(*) DESC
LIMIT 1')
x y
1 1 1
If we'd like to show a frequency column, and not just one row (in case there are any ties):
x = c(1, 1, 2, 3, 4, 12, 12)
y = c(1, 1, 5, 6, 9, 12, 12)
xy = data.frame(x, y)
sqldf('SELECT x, y, COUNT(*) AS freq
FROM xy
GROUP BY (x||y)
ORDER BY COUNT(*) DESC')
x y freq
1 1 1 2
2 12 12 2
3 2 5 1
4 3 6 1
5 4 9 1
Upvotes: 1
Reputation: 21492
Late to the party, but here's a time test:
x<-sample(1:10,1e5,rep=TRUE)
y<-sample(1:10,1e5,rep=TRUE)
martin <- function(x, y) {
x = match(x, unique(x)); y = match(y, unique(y))
v = x + (max(x) - 1L) * y
which.max(tabulate(match(v, v)))
}
akrun <-function(x,y) {
library(data.table)
xy<-data.frame(x,y)
setDT(xy)[, .N,list(x,y)][which.max(N)]
}
mucio <-function(x,y){
xy<-data.frame(x,y)
t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)
}
sam<-function(x,y){
library(dplyr)
xy<-data.frame(x,y)
xy %>%
group_by(x, y) %>%
tally() %>%
ungroup %>%
top_n(1)
}
dimitris<-function(x,y){
library(dplyr)
xy<-data.frame(x,y)
xy %>% group_by(x, y) %>% summarise(n=n()) %>%
ungroup %>% filter(n==max(n)) %>% select(-n)
}
microbenchmark(martin(x,y),akrun(x,y),mucio(x,y),sam(x,y),dimitris(x,y),times=5)
Unit: milliseconds
expr min lq mean median uq
martin(x, y) 11.727217 14.246913 41.359218 14.384385 82.639796
akrun(x, y) 4.426462 4.613420 4.866548 4.892432 5.011406
mucio(x, y) 73.938586 74.037568 103.941459 79.516207 145.232870
sam(x, y) 8.356426 8.586212 8.919787 8.586521 8.775792
dimitris(x, y) 8.618394 8.738228 9.252105 9.063965 9.075298
max neval cld
83.797780 5 a
5.389018 5 a
146.982062 5 b
10.293983 5 a
10.764640 5 a
Upvotes: 1
Reputation: 46856
(Despite all the plus votes, a hybrid of @DavidArenburg and my approaches
res = do.call("paste", c(xy, sep="\r"))
which.max(tabulate(match(res, res)))
might be simple and effective.)
Maybe it seems a little round-about, but a first step is to transform the possibly arbitrary values in the columns of xy
to integers ranging from 1 to the number of unique values in the column
x = match(xy[[1]], unique(xy[[1]]))
y = match(xy[[2]], unique(xy[[2]]))
Then encode the combination of columns to unique values
v = x + (max(x) - 1L) * y
Indexing minimizes the range of values under consideration, and encoding reduces a two-dimensional problem to a single dimension. These steps reduce the space required of any tabulation (as with table()
in other answers) to the minimum, without creating character vectors.
If one wanted to most common occurrence in a single dimension, then one could index and tabulate v
tbl = tabulate(match(v, v))
and find the index of the first occurrence of the maximum value(s), e.g.,
df[which.max(tbl),]
Here's a function to do the magic
whichpairmax <- function(x, y) {
x = match(x, unique(x)); y = match(y, unique(y))
v = x + (max(x) - 1L) * y
which.max(tabulate(match(v, v)))
}
and a couple of tests
> set.seed(123)
> xy[whichpairmax(xy[[1]], xy[[2]]),]
x y
1 1 1
> xy1 = xy[sample(nrow(xy)),]
> xy1[whichpairmax(xy1[[1]], xy1[[2]]),]
x y
1 1 1
> xy1
x y
3 2 5
5 4 9
7 6 12
4 3 6
6 5 10
1 1 1
2 1 1
For an arbitrary data.frame
whichdfmax <- function(df) {
v = integer(nrow(df))
for (col in df) {
col = match(col, unique(col))
v = col + (max(col) - 1L) * match(v, unique(v))
}
which.max(tabulate(match(v, v)))
}
Upvotes: 9
Reputation: 7119
t<-table(xy)
which(t == max(t), arr.ind = TRUE)
Update:
As pointed out by David Arenburg, the initial code returned just the index of the values from the table(xy)
function. If you need the values and maybe the number of occurrences of the max couple you can try this:
t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)
Now I suspect there is better way to write the last three lines of code, but I'm still new to the R world
Upvotes: 4
Reputation: 92282
Not sure how will the desired output should look like, but here's a possible solution
res <- table(do.call(paste, xy))
res[which.max(res)]
# 1 1
# 2
In order to get the actual values, one could do
res <- do.call(paste, xy)
xy[which.max(ave(seq(res), res, FUN = length)), ]
# x y
# 1 1 1
Upvotes: 11
Reputation: 5951
With dplyr
library(dplyr)
xy %>% group_by(x, y) %>% summarise(n=n()) %>%
ungroup %>% filter(n==max(n)) %>% select(-n)
Upvotes: 1
Reputation: 23004
library(dplyr)
xy %>%
group_by(x, y) %>%
tally() %>%
ungroup %>%
top_n(1)
Upvotes: 1
Reputation: 6913
library(data.table)
DT <- data.table(xy)
tail(DT[, Count := .N, by = c("x", "y")][ order(Count) ], 1)
x y Count
1: 1 1 2
Upvotes: 3
Reputation: 886938
Try
library(data.table)
setDT(xy)[, .N,list(x,y)][which.max(N)]
# x y N
#1: 1 1 2
Upvotes: 7
Reputation: 1417
What about this?
x = c(1, 1, 2, 3, 4, 5, 6)
y = c(1, 1, 5, 6, 9, 10, 12)
xy = data.frame(x, y)
table(xy)
y
x 1 5 6 9 10 12
1 2 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 0
4 0 0 0 1 0 0
5 0 0 0 0 1 0
6 0 0 0 0 0 1
Upvotes: 3