Reputation: 751

Remove semi duplicate rows in R

I have the following data.frame.

a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)


  a  b c d
1 A NA 1 1
2 A  1 1 2
3 A  2 2 3
4 B  4 4 4
5 B  1 1 5
6 B NA 1 6
7 C  2 2 7
8 C NA 2 8
9 D NA 2 9

I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.

Upvotes: 0

Answers (3)

Ranjan Raj

Reputation: 11

First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :

x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]

Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.

rbind(na.omit(x),x1)

Answer:

   a  b c d

2  A  1 1 2

3  A  2 2 3

4  B  4 4 4

5  B  1 1 5

7  C  2 2 7

9  D NA 2 9

Upvotes: 1

discipulus

Reputation: 2725

You can use dplyr to do this.

df %>% distinct(a, c,  .keep_all = TRUE)

Output

There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Upvotes: 0

akrun

Reputation: 887971

One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements

df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
#  a  b c d
#2 A  1 1 2
#3 A  2 2 3
#5 B  1 1 5
#4 B  4 4 4
#7 C  2 2 7
#9 D NA 2 9

setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8

Upvotes: 1

Remove semi duplicate rows in R

Answers (3)

Output

Related Questions