Rich
Rich

Reputation: 141

How to count the different values by comparing two columns in R?

I want to count the values by comparing two columns of the dataframe in R.

For example:

col1  col2
A      A
A      A
A      B
G      G
G      H
Y      Y
Y      Y
J      P
J      P
J      J
K      L

I wish to get an output which shows the count of match (if two columns have same values) and the count of not match (if two columns have different value) and display the percentage of match and not match in the next columns

col1   count_match  count_notmatch   percent_match   percent_notmatch
A       2           1                66.66%          33.33%
G       1           1                50.00%          50.00%
Y       2           0                100.00%         0
J       1           2                33.33%          66.66%
K       0           1                0               100%

How do I achieve this? Thanks for any help.

Upvotes: 0

Views: 777

Answers (2)

Dan Adams
Dan Adams

Reputation: 5204

You can build the summary table in a few steps:

library(tidyverse)
library(scales)

d <- structure(list(col1 = c("A", "A", "A", "G", "G", "Y", "Y", "J", "J", "J", "K"), 
                    col2 = c("A", "A", "B", "G", "H", "Y", "Y", "P", "P", "J", "L")), class = "data.frame", row.names = c(NA, -11L))


d %>% 
  mutate(match = col1 == col2, 
         nomatch = !match) %>% 
  group_by(col1) %>% 
  summarise(count_match = sum(match),
            count_nomatch = sum(nomatch)) %>% 
  rowwise() %>% 
  mutate(percent_match = count_match/sum(count_match, count_nomatch), 
         percent_nomatch = 1 - percent_match) %>% 
  mutate(across(starts_with("percent"), ~percent(.x))) %>%
  ungroup()
#> # A tibble: 5 × 5
#>   col1  count_match count_nomatch percent_match percent_nomatch
#>   <chr>       <int>         <int> <chr>         <chr>          
#> 1 A               2             1 67%           33%            
#> 2 G               1             1 50%           50%            
#> 3 J               1             2 33%           67%            
#> 4 K               0             1 0%            100%           
#> 5 Y               2             0 100%          0%

Created on 2022-07-18 by the reprex package (v2.0.1)

Upvotes: 1

Darren Tsai
Darren Tsai

Reputation: 35554

You could group the data by col1 and summarise():

library(dplyr)

df %>%
  group_by(col1) %>%
  summarise(count_match = sum(col1 == col2),
            count_nomatch = n() - count_match,
            across(contains("match"), ~ .x / n() * 100, .names = "{sub('count', 'percent', .col)}"))

# # A tibble: 5 × 5
#   col1  count_match count_nomatch percent_match percent_nomatch
#   <chr>       <int>         <int>         <dbl>           <dbl>
# 1 A               2             1          66.7            33.3
# 2 G               1             1          50              50
# 3 J               1             2          33.3            66.7
# 4 K               0             1           0             100  
# 5 Y               2             0         100               0

Upvotes: 1

Related Questions