Luuk van Gasteren
Luuk van Gasteren

Reputation: 111

Find the most occuring column in data frame

I want to find the column of a data frame that occurres the most in a certain data frame. For example, the following data frame:

as.data.frame(cbind(c(1,4,6,9,20),c(2,4,7,7,3),c(4,7,6,4,2),c(1,4,6,9,20),c(4,7,6,4,2),c(7,4,6,4,2)))

My first idea is to order the columns in the data frame so I get: as.data.frame(cbind(c(1,4,6,9,20),c(2,3,4,7,7),c(2,4,4,6,7),c(1,4,6,9,20),c(2,4,4,6,7),c(2,4,4,6,7))) and then finding the most occurring column in this data frame. It then should return c(2,4,4,6,7). How can this by done in R?

Upvotes: 1

Views: 271

Answers (3)

Andrew
Andrew

Reputation: 5138

In case you want to match multiple columns:

# Creating a table of pasted & sorted column values
counts_df1 <- table(do.call(paste, data.frame(t(sapply(df1, sort)))))

# If you want the sorted order returned as a single element vector:
names(counts_df1[counts_df1 == max(counts_df1)])
[1] "2 4 4 6 7"

Or, you could do this to index the column in your dataframe:

# Creating collapsed strings from columns
df1_vec <- sapply(df1, function(x) paste0(sort(x), collapse = ""))

# Counting the frequency of each collapsed strings
df1_colsum <- colSums(outer(df1_vec, df1_vec, `==`))

# Subsetting the dataframe based on the most frequent columns that are not duplicates
df1[, df1_colsum == max(df1_colsum) & !duplicated(df1_vec)]
[1] 4 7 6 4 2

Upvotes: 1

yarnabrina
yarnabrina

Reputation: 1656

Basically same solution as of @jay.sf, but using tidyverse:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- as.data.frame(x = cbind(c(1, 4, 6, 9, 20),
                                   c(2, 4, 7, 7, 3),
                                   c(4, 7, 6, 4, 2),
                                   c(1, 4, 6, 9, 20),
                                   c(4, 7, 6, 4, 2),
                                   c(7, 4, 6, 4, 2)))

dataset %>%
  summarise_all(.funs = ~ paste0(sort(.), collapse = "")) %>%
  as.numeric() %>%
  table() %>%
  which.max() %>%
  `[`(dataset, .)
#>   V2
#> 1  2
#> 2  4
#> 3  7
#> 4  7
#> 5  3

Created on 2019-06-15 by the reprex package (v0.3.0)

Upvotes: 2

jay.sf
jay.sf

Reputation: 72593

Basically you could paste the numbers and count them in a table, then choose which.max.

d1[, which.max(table(sapply(d1, paste, collapse="")))]
# [1] 2 4 4 6 7

Data

d1 <- structure(list(X1 = c(1, 4, 6, 9, 20), X2 = c(2, 3, 4, 7, 7), 
X3 = c(2, 4, 4, 6, 7), X4 = c(1, 4, 6, 9, 20), X5 = c(2, 
4, 4, 6, 7), X6 = c(2, 4, 4, 6, 7)), class = "data.frame", row.names = c(NA, -5L))

Upvotes: 2

Related Questions