nred
nred

Reputation: 25

Find column with most matches across multiple columns R

I have data that looks like this:

   id   a   b   c
1 346 BBX BBX RRS
2 677 BBX TFT TFT
3 999 BBX CAT CAT

What I want to do is: for each row, find the value across a,b,c that has the most matches. So the result would look like this:

   id   y
1 346 BBX
2 677 TFT
3 999 CAT

If there are cases where there is no column with more matches than any other, I am indifferent to which value is chosen.

Any suggestions?

I also got as far as transforming the data in the manner below, and creating a variable "n", for the number of occurrences of each x. But not sure how "select" the x with highest n for each id.

   id  x   n
1 346 BBX  2
2 346 RRS  1
3 677 BBX  1
4 677 TFT  2
5 999 BBX  1
6 999 CAT  2

Here's some code the data:

id <- c("346", "677", "999")
a <- rep ("BBX", 3)
b <- c("BBX", "TFT", "CAT")
c <- c("RRS", "TFT", "CAT")
x <- data.frame(id, a,b,c, stringsAsFactors=F)

or

id <- c("346", "346", "677", "677", "999", "999")
x <- c("BBX", "RRS", "BBX", "TFT", "BBX", "CAT")
n <- c("2", "1", "1", "2", "1", "2")
bl <- data.frame(id, x, n)

THANK YOU!

Upvotes: 1

Views: 122

Answers (1)

StupidWolf
StupidWolf

Reputation: 46888

If you have already pivot it longer, you can use top_n, like this:

x %>% pivot_longer(-id) %>% group_by(id) %>%  
count(value) %>% arrange(id,-n) %>% 
filter(!duplicated(id))
# A tibble: 3 x 3
# Groups:   id [3]
  id    value     n
  <chr> <chr> <int>
1 346   BBX       2
2 677   TFT       2
3 999   CAT       2

If you have NAs, for example, you can try to filter them out:

x[1,3:4]= NA
x %>% pivot_longer(-id) %>% 
group_by(id) %>% filter(!is.na(value)) %>% 
count(value) %>% arrange(id,-n) %>% 
filter(!duplicated(id))

Or in base R, you can apply a sort(table(..)) function row-wise:

most_freq = apply(x[,-1],1,function(i)names(sort(-table(i)))[1])
data.frame(id=x$id,most_freq)
   id most_freq
1 346       BBX
2 677       TFT
3 999       CAT

Upvotes: 1

Related Questions