colin
colin

Reputation: 2666

Check if any columns within a data frame are identical in R

I am iteratively fitting models to many different variables, and in a few rare cases two columns I am using as independent variables contain an identical set of values. This makes the model unidentifiable and throws an error. I would like a way to check if any columns are identical to any other columns within a dataframe, and then return the names of the columns that have a problem. Here is an example dataframe.

a <- rnorm(10)
b <- rnorm(10)
c <- a
d <- rnorm(10)
dat <- data.frame(a,b,c,d)

Folks have answered how to test if two individual columns in a dataframe are identical here. However, I would like a way to check every column against every other column.

Upvotes: 2

Views: 1166

Answers (3)

IceCreamToucan
IceCreamToucan

Reputation: 28675

You can use combn to get all pairs of column numbers, then apply over the resulting matrix to check if all elements are equal.

pairs <- t(combn(seq_len(ncol(dat)), 2))

same <- apply(pairs, 1, function(x) all(Reduce(`==`, dat[,x])))

pairs[same,]
# [1] 1 3

Or check the correlations (will also include linear combinations)

cor1 <- data.frame(which(cor(dat) == 1, arr.ind = T))
cor1[cor1$row > cor1$col,]
#   row col
# c   3   1

Upvotes: 6

Lamia
Lamia

Reputation: 3875

You could use the dist function to compute the matrix of distances between your columns, and find the combinations of columns for which the distance is 0.

m = as.matrix(dist(t(dat)))
m[upper.tri(m,diag=T)] = NA
which(m<1.5e-8,arr.ind=T)

  row col
c   3   1

Note that this solution will only work for numerical columns. If you have qualitative variables in your dataframe, you won't be able to compare them.

Upvotes: 1

markus
markus

Reputation: 26343

The caret package contains the function findLinearCombos that you might wanna try

caret::findLinearCombos(dat)
#$linearCombos
#$linearCombos[[1]]
#[1] 3 1


#$remove
#[1] 3

But be aware that the function would also recommend the deletion of a column that is a times minus 1

Second example

dat2 <- data.frame(a,b,c,d, e = -a) 
caret::findLinearCombos(dat2)
#$linearCombos
#$linearCombos[[1]]
#[1] 3 1

#$linearCombos[[2]]
#[1] 5 1


#$remove
#[1] 3 5

Upvotes: 6

Related Questions