user11368874
user11368874

Reputation: 31

How to identify names of columns having duplicate values in R

The dataframe has columns which consists of duplicate values. How do we identify the colnames which have duplicate value and their count.

data set :

A  B  C
1  2  a
2  3  b
3  4  a
1  5  c

I need output like

columns having duplicate values are A(2) and C(2)

I have tried duplicated(), it returns a vector.

Upvotes: 2

Views: 435

Answers (3)

akrun
akrun

Reputation: 887028

We can loop over the columns and find any duplicates with table to identify the column names

names(df1)[sapply(df1, function(x) any(table(x) > 1))]
#[1] "A" "C"

Or another base R solution would be

names(Filter(anyDuplicated, df1))
#[1] "A" "C"

Or with tidyverse

library(tidyverse)
names(df1)[!!map_int(df1, anyDuplicated)]
#[1] "A" "C"

data

df1 <- structure(list(A = c(1L, 2L, 3L, 1L), B = 2:5, C = c("a", "b", 
   "a", "c")), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 1

jogo
jogo

Reputation: 12559

To test if there are duplicate values in a column one can compare the number of unique values with the number of values in the column:

D <- read.table(header=TRUE, stringsAsFactors = FALSE, text=
"A  B  C
1  2  a
2  3  b
3  4  a
1  5  c")
names(D)[sapply(D, function(x) length(unique(x)))!=nrow(D)]

or

names(D)[sapply(D, function(x) any(duplicated(x)))]

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388862

We could use sapply to loop column wise find the duplicated elements and take sum of it.

colSums(sapply(df, function(x) duplicated(x)|duplicated(x, fromLast = TRUE)))

#A B C 
#2 0 2 

If you need only non-zero column you could do

vals <- colSums(sapply(df, function(x) 
                duplicated(x)|duplicated(x, fromLast = TRUE)))
vals[vals!= 0]
#A C 
#2 2 

Using dplyr we can do the same with summarise_all

library(dplyr)
df %>%
  summarise_all(~sum(duplicated(.) | duplicated(., fromLast = TRUE))) %>%
  select_if(~. != 0)

#  A C
#1 2 2

Upvotes: 3

Related Questions