Jay
Jay

Reputation: 751

Remove semi duplicate rows in R

I have the following data.frame.

a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)


  a  b c d
1 A NA 1 1
2 A  1 1 2
3 A  2 2 3
4 B  4 4 4
5 B  1 1 5
6 B NA 1 6
7 C  2 2 7
8 C NA 2 8
9 D NA 2 9

I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.

Upvotes: 0

Views: 223

Answers (3)

Ranjan Raj
Ranjan Raj

Reputation: 11

First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :

x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]

Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.

rbind(na.omit(x),x1)

Answer:

   a  b c d

2  A  1 1 2

3  A  2 2 3

4  B  4 4 4

5  B  1 1 5

7  C  2 2 7

9  D NA 2 9

Upvotes: 1

discipulus
discipulus

Reputation: 2725

You can use dplyr to do this.

df %>% distinct(a, c,  .keep_all = TRUE)                 

Output

  a  b c d
1 A NA 1 1
2 A  2 2 3
3 B  4 4 4
4 B  1 1 5
5 C  2 2 7
6 D NA 2 9

There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Upvotes: 0

akrun
akrun

Reputation: 887971

One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements

df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
#  a  b c d
#2 A  1 1 2
#3 A  2 2 3
#5 B  1 1 5
#4 B  4 4 4
#7 C  2 2 7
#9 D NA 2 9

setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8

Upvotes: 1

Related Questions