Reputation: 185

Retrieve the most repeated (x, y) values in two columns in a data frame

I am storing (x, y) values in a dataframe. I want to return the most frequently appearing (x, y) combination.

Here is an example:

> x = c(1, 1, 2, 3, 4, 5, 6)
> y = c(1, 1, 5, 6, 9, 10, 12)
> xy = data.frame(x, y)
> xy
  x  y
1 1  1
2 1  1
3 2  5
4 3  6
5 4  9
6 5 10
7 6 12

The most common (x, y) value would be (1, 1).

I tried the answer here for a single column. It works for a single column, but does not work for an aggregate of two columns.

> tail(names(sort(table(xy$x))), 1)
[1] "1"
> tail(names(sort(table(xy$x, xy$y))), 1)
NULL

How do I retrieve the most repeated (x, y) values in two columns in a data frame in R?

EDIT: c(1, 2) should be considered distinct from c(2, 1).

Upvotes: 12

Answers (10)

mpalanco

Reputation: 13570

Using sqldf:

library(sqldf)    
sqldf('SELECT x, y 
          FROM xy 
          GROUP BY (x||y) 
          ORDER BY COUNT(*) DESC 
          LIMIT 1')
  x y
1 1 1

If we'd like to show a frequency column, and not just one row (in case there are any ties):

x = c(1, 1, 2, 3, 4, 12, 12)
y = c(1, 1, 5, 6, 9, 12, 12)
xy = data.frame(x, y)

sqldf('SELECT x, y, COUNT(*) AS freq
      FROM xy 
      GROUP BY (x||y) 
      ORDER BY COUNT(*) DESC')

   x  y freq
1  1  1    2
2 12 12    2
3  2  5    1
4  3  6    1
5  4  9    1

Upvotes: 1

Carl Witthoft

Reputation: 21492

Late to the party, but here's a time test:

x<-sample(1:10,1e5,rep=TRUE)
y<-sample(1:10,1e5,rep=TRUE)


martin  <- function(x, y) {
    x = match(x, unique(x)); y = match(y, unique(y))
    v = x + (max(x) - 1L) * y
    which.max(tabulate(match(v, v)))
}
akrun <-function(x,y) {
    library(data.table)
    xy<-data.frame(x,y)
setDT(xy)[, .N,list(x,y)][which.max(N)]
}
mucio <-function(x,y){
    xy<-data.frame(x,y)
    t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)

}

sam<-function(x,y){
    library(dplyr)
    xy<-data.frame(x,y)
xy %>%
  group_by(x, y) %>%
  tally() %>%
  ungroup %>%
  top_n(1)

}
dimitris<-function(x,y){
    library(dplyr)
xy<-data.frame(x,y)
xy %>% group_by(x, y) %>% summarise(n=n()) %>% 
   ungroup %>% filter(n==max(n)) %>% select(-n)

}

microbenchmark(martin(x,y),akrun(x,y),mucio(x,y),sam(x,y),dimitris(x,y),times=5)

Unit: milliseconds
           expr       min        lq       mean    median         uq
   martin(x, y) 11.727217 14.246913  41.359218 14.384385  82.639796
    akrun(x, y)  4.426462  4.613420   4.866548  4.892432   5.011406
    mucio(x, y) 73.938586 74.037568 103.941459 79.516207 145.232870
      sam(x, y)  8.356426  8.586212   8.919787  8.586521   8.775792
 dimitris(x, y)  8.618394  8.738228   9.252105  9.063965   9.075298
        max neval cld
  83.797780     5  a 
   5.389018     5  a 
 146.982062     5   b
  10.293983     5  a 
  10.764640     5  a

Upvotes: 1

Martin Morgan

Reputation: 46856

(Despite all the plus votes, a hybrid of @DavidArenburg and my approaches

res = do.call("paste", c(xy, sep="\r"))
which.max(tabulate(match(res, res)))

might be simple and effective.)

Maybe it seems a little round-about, but a first step is to transform the possibly arbitrary values in the columns of xy to integers ranging from 1 to the number of unique values in the column

x = match(xy[[1]], unique(xy[[1]]))
y = match(xy[[2]], unique(xy[[2]]))

Then encode the combination of columns to unique values

v = x + (max(x) - 1L) * y

Indexing minimizes the range of values under consideration, and encoding reduces a two-dimensional problem to a single dimension. These steps reduce the space required of any tabulation (as with table() in other answers) to the minimum, without creating character vectors.

If one wanted to most common occurrence in a single dimension, then one could index and tabulate v

tbl = tabulate(match(v, v))

and find the index of the first occurrence of the maximum value(s), e.g.,

df[which.max(tbl),]

Here's a function to do the magic

whichpairmax <- function(x, y) {
    x = match(x, unique(x)); y = match(y, unique(y))
    v = x + (max(x) - 1L) * y
    which.max(tabulate(match(v, v)))
}

and a couple of tests

> set.seed(123)
> xy[whichpairmax(xy[[1]], xy[[2]]),]
  x y
1 1 1
> xy1 = xy[sample(nrow(xy)),]
> xy1[whichpairmax(xy1[[1]], xy1[[2]]),]
  x y
1 1 1
> xy1
  x  y
3 2  5
5 4  9
7 6 12
4 3  6
6 5 10
1 1  1
2 1  1

For an arbitrary data.frame

whichdfmax <- function(df) {
    v = integer(nrow(df))
    for (col in df) {
        col = match(col, unique(col))
        v = col + (max(col) - 1L) * match(v, unique(v))
    }
    which.max(tabulate(match(v, v)))
}

Upvotes: 9

mucio

Reputation: 7119

t<-table(xy)
which(t == max(t), arr.ind = TRUE)

Update:

As pointed out by David Arenburg, the initial code returned just the index of the values from the table(xy) function. If you need the values and maybe the number of occurrences of the max couple you can try this:

t<-table(xy)
indexes <- which(t == max(t), arr.ind = TRUE)[1,]
x_value <- dimnames(t)$x[indexes["x"]]
y_value <- dimnames(t)$y[indexes["y"]]
rep_number <- max(t)

Now I suspect there is better way to write the last three lines of code, but I'm still new to the R world

Upvotes: 4

David Arenburg

Reputation: 92282

Not sure how will the desired output should look like, but here's a possible solution

res <- table(do.call(paste, xy))
res[which.max(res)]
# 1 1 
#   2

In order to get the actual values, one could do

res <- do.call(paste, xy) 
xy[which.max(ave(seq(res), res, FUN = length)), ]
#   x y
# 1 1 1

Upvotes: 11

dimitris_ps

Reputation: 5951

With dplyr

library(dplyr)

xy %>% group_by(x, y) %>% summarise(n=n()) %>% 
   ungroup %>% filter(n==max(n)) %>% select(-n)

Upvotes: 1

Sam Firke

Reputation: 23004

library(dplyr)
xy %>%
  group_by(x, y) %>%
  tally() %>%
  ungroup %>%
  top_n(1)

Upvotes: 1

mlegge

Reputation: 6913

library(data.table)
DT <- data.table(xy)
tail(DT[, Count := .N, by = c("x", "y")][ order(Count) ], 1)
    x y Count
 1: 1 1     2

Upvotes: 3

akrun

Reputation: 886938

Try

library(data.table)
setDT(xy)[, .N,list(x,y)][which.max(N)]
#   x y N
#1: 1 1 2

Upvotes: 7

Jaehyeon Kim

Reputation: 1417

What about this?

x = c(1, 1, 2, 3, 4, 5, 6)
y = c(1, 1, 5, 6, 9, 10, 12)
xy = data.frame(x, y)

table(xy)
y
x   1 5 6 9 10 12
1 2 0 0 0  0  0
2 0 1 0 0  0  0
3 0 0 1 0  0  0
4 0 0 0 1  0  0
5 0 0 0 0  1  0
6 0 0 0 0  0  1

Upvotes: 3

Retrieve the most repeated (x, y) values in two columns in a data frame

Answers (10)

Related Questions