Reputation: 31
The dataframe has columns which consists of duplicate values. How do we identify the colnames which have duplicate value and their count.
data set :
A B C
1 2 a
2 3 b
3 4 a
1 5 c
I need output like
columns having duplicate values are A(2) and C(2)
I have tried duplicated(), it returns a vector.
Upvotes: 2
Views: 435
Reputation: 887028
We can loop over the columns and find any
duplicates with table
to identify the column names
names(df1)[sapply(df1, function(x) any(table(x) > 1))]
#[1] "A" "C"
Or another base R
solution would be
names(Filter(anyDuplicated, df1))
#[1] "A" "C"
Or with tidyverse
library(tidyverse)
names(df1)[!!map_int(df1, anyDuplicated)]
#[1] "A" "C"
df1 <- structure(list(A = c(1L, 2L, 3L, 1L), B = 2:5, C = c("a", "b",
"a", "c")), class = "data.frame", row.names = c(NA, -4L))
Upvotes: 1
Reputation: 12559
To test if there are duplicate values in a column one can compare the number of unique values with the number of values in the column:
D <- read.table(header=TRUE, stringsAsFactors = FALSE, text=
"A B C
1 2 a
2 3 b
3 4 a
1 5 c")
names(D)[sapply(D, function(x) length(unique(x)))!=nrow(D)]
or
names(D)[sapply(D, function(x) any(duplicated(x)))]
Upvotes: 1
Reputation: 388862
We could use sapply
to loop column wise find the duplicated
elements and take sum
of it.
colSums(sapply(df, function(x) duplicated(x)|duplicated(x, fromLast = TRUE)))
#A B C
#2 0 2
If you need only non-zero column you could do
vals <- colSums(sapply(df, function(x)
duplicated(x)|duplicated(x, fromLast = TRUE)))
vals[vals!= 0]
#A C
#2 2
Using dplyr
we can do the same with summarise_all
library(dplyr)
df %>%
summarise_all(~sum(duplicated(.) | duplicated(., fromLast = TRUE))) %>%
select_if(~. != 0)
# A C
#1 2 2
Upvotes: 3