ayush
ayush

Reputation: 343

Removing duplicate rows on the basis of specific columns

How can I remove the duplicate rows on the basis of specific columns while maintaining the dataset. I tried using these links1, link2

What I want to do is I want to see the ambiguity on the basis of column 3 to 6. If their values are same then the processed dataset should remove the rows, as shown in the example:

I used this code but I gave me half result:

Data <- unique(Data[, 3:6])

Lets suppose my dataset is like this

 A  B  C  D  E  F  G  H  I  J  K  L  M
 1  2  2  1  5  4  12 A  3  5  6  2  1
 1  2  2  1  5  4  12 A  2 35  36 22 21
 1  22 32 31 5 34  12 A  3  5  6  2  1

What I want in my output is:

 A  B  C  D  E  F  G  H  I  J  K  L  M
 1  2  2  1  5  4  12 A  3  5  6  2  1
 1  22 32 31 5 34  12 A  3  5  6  2  1    

Upvotes: 1

Views: 223

Answers (2)

RHertel
RHertel

Reputation: 23788

Assuming that your data is stored as a dataframe, you could try:

Data <- Data[!duplicated(Data[,3:6]),]
#> Data
#  A  B  C  D E  F  G H I J K L M
#1 1  2  2  1 5  4 12 A 3 5 6 2 1
#3 1 22 32 31 5 34 12 A 3 5 6 2 1

The function duplicated() returns a logical vector containing in this case information for each row about whether the combination of the entries in column 3 to 6 reappears elsewhere in the dataset. The negation ! of this logical vector is used to select the rows from your dataset, resulting in a dataset with unique combinations of the entries in column 3 to 6.

Thanks to @thelatemail for pointing out a mistake in my previous post.

Upvotes: 2

akrun
akrun

Reputation: 886938

Another option is unique from data.table. It has the by option. We convert the 'data.frame' to 'data.table' (setDT(df1)), use unique and specify the columns within the by

 library(data.table)
 unique(setDT(df1), by= names(df1)[3:6])
 #   A  B  C  D E  F  G H I J K L M
 #1: 1  2  2  1 5  4 12 A 3 5 6 2 1
 #2: 1 22 32 31 5 34 12 A 3 5 6 2 1

unique returns a data.table with duplicated rows removed.

Upvotes: 2

Related Questions