BenHealey
BenHealey

Reputation: 348

Identifying duplicate columns in a dataframe

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.

My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))

This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):

summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))

If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?

Upvotes: 29

Views: 43089

Answers (10)

Emmanuel-Lin
Emmanuel-Lin

Reputation: 1943

Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.

This a function I implemented in dataPreparation package.

The function

dataPreparation::which_are_bijection

which_are_in_double(testframe)

Which return 3 and 4 the columns that are duplicated in your example

Build a data set with wanted dimensions for performance tests

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

for (i in 1:12){
  testframe = rbind(testframe,testframe)  
}
# Result in 49152 rows
for (i in 1:5){
  testframe = cbind(testframe,testframe)  
}
# Result in 160 columns

The benchmark To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times

benchmark(
  which_are_in_double(testframe, verbose=FALSE),
  duplicated(lapply(testframe, summary)),
  duplicated(lapply(testframe, digest))
)
                                             test replications elapsed
3           duplicated(lapply(testframe, digest))          100  39.505
2          duplicated(lapply(testframe, summary))          100  20.412
1 which_are_in_double(testframe, verbose = FALSE)          100  13.581

So which are bijection 3 to 1.5 times faster than other proposed solutions.

NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.

NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.

Upvotes: 0

Vitali Avagyan
Vitali Avagyan

Reputation: 1203

What about just:

unique.matrix(testframe, MARGIN=2) 

Upvotes: 2

M Boulanger
M Boulanger

Reputation: 1

If the problem is that dataframes have been merged one time too many using, for example:

    testframe2 <- merge(testframe, testframe, by = c('age'))

It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:

    testframe2 <- testframe2[!duplicated(as.list(testframe2))]
    names(testframe2) <- gsub('.x','',names(testframe2))

Upvotes: 0

Matt Elgazar
Matt Elgazar

Reputation: 725

It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:

names(df)[duplicated(names(df))]

Upvotes: 2

Fabio Natalini
Fabio Natalini

Reputation: 197

Here is a simple command that would work if the duplicated columns of your data frame had the same names:

testframe[names(testframe)[!duplicated(names(testframe))]]

Upvotes: 0

Holger Brandl
Holger Brandl

Reputation: 11192

Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)

require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.)))) 

Upvotes: 0

hshihab
hshihab

Reputation: 416

A nice trick that you can use is to transpose your data frame and then check for duplicates.

duplicated(t(testframe))

Upvotes: 11

Mostafa Rezaei
Mostafa Rezaei

Reputation: 629

How about:

testframe[!duplicated(as.list(testframe))]

Upvotes: 36

Henry
Henry

Reputation: 6784

unique(testframe, MARGIN=2) 

does not work, though I think it should, so try

as.data.frame(unique(as.matrix(testframe), MARGIN=2))

or if you are worried about numbers turning into factors,

testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]

which produces

   age height gender
1   18   76.1      M
2   19   77.0      F
3   20   78.1      M
4   21   78.2      M
5   22   78.8      F
6   23   79.7      F
7   24   79.9      M
8   25   81.1      M
9   26   81.2      F
10  27   81.8      M
11  28   82.8      F
12  29   83.5      M

Upvotes: 3

kohske
kohske

Reputation: 66842

You can do with lapply:

testframe[!duplicated(lapply(testframe, summary))]

summary summarizes the distribution while ignoring the order.

Not 100% but I would use digest if the data is huge:

library(digest)
testframe[!duplicated(lapply(testframe, digest))]

Upvotes: 24

Related Questions