Leah Liu
Leah Liu

Reputation: 31

R select one row from duplicated rows after compare multi conditions

I got these duplicated records from ton of data. Now, I need to choose one row from these duplicated rows.

ID <- c("6820","6820","17413","17413","38553","38553","52760","52760","717841","717841","717841","747187","747187","747187")
date <- c("2014-06-12","2015-06-11","2014-05-01","2014-05-01","2014-06-12","2015-06-11","2014-10-24","2014-10-24","2014-05-01","2014-05-01","2014-12-02","2014-03-01","2014-05-12","2014-05-12")
type <- c("ST","ST","MC","MC","LC","LC","YA","YA","YA","YA","MC","LC","LC","MC")
level <-c("firsttime","new","new","active","active","active","firsttime","new","active","new","active","new","active","active")
data <- data.frame(ID,date,type,level)

The data frame will look like this: enter image description here

I want to compare this: for each ID,if their dates are different, then keep all of them in df.right; if the date is same, then compare type, choose them in order of LC>MC>YA>ST (eg. choose MC over YA), put them into df.right; if type is same, then compare level, choose them in order of active>new>firsttime (eg. choose new over first time), and put the choosen into df.right.

I tried to use foreach, this is only on the first step, and it is not working for ID have 3 duplicated rows.

foreach (i=unique(data$ID), .combine='rbind') %do% {data[data$ID==i, "date"][1] == data[data$ID==i, "date"][2])
b <- data[data$ID==i,]}

The result should be like this: enter image description here Does anybody knows how to do this? really appreciate it. Thank you

Upvotes: 3

Views: 390

Answers (2)

pcantalupo
pcantalupo

Reputation: 2226

The trick here is to order the levels of type and level as appropriate. Then two deduplications are necessary: first, to remove duplicate rows based on the columns ID, date, type and second, remove dup rows based upon first two columns:

type = factor(type, levels=c("ST","YA","MC","LC"))
level = factor(level, levels=c("active","new","firsttime"))
data <- data.frame(ID,date,type,level)

d = with(data, data[order(ID, date, type, level),])
e = d[-which(duplicated(d[,1:3])),]
df.right = e[-which(duplicated(e[,1:2])),]
df.right = df.right[order(as.numeric(as.character(df.right$ID)), df.right$date),]
df.right

Output:

       ID       date type     level
1    6820 2014-06-12   ST firsttime
2    6820 2015-06-11   ST       new
4   17413 2014-05-01   MC    active
5   38553 2014-06-12   LC    active
6   38553 2015-06-11   LC    active
8   52760 2014-10-24   YA       new
9  717841 2014-05-01   YA    active
11 717841 2014-12-02   MC    active
12 747187 2014-03-01   LC       new
14 747187 2014-05-12   MC    active

Upvotes: 1

C_Z_
C_Z_

Reputation: 7796

The dplyr package is good for this sort of thing

Using factors, you can specify how you want your categories ordered. Then you can pick the first of each type and level for each unique ID/date pair.

library(dplyr)

ID <- c("6820","6820","17413","17413","38553","38553","52760","52760","717841","717841","717841","747187","747187","747187")
date <- c("2014-06-12","2015-06-11","2014-05-01","2014-05-01","2014-06-12","2015-06-11","2014-10-24","2014-10-24","2014-05-01","2014-05-01","2014-12-02","2014-03-01","2014-05-12","2014-05-12")
type <- c("ST","ST","MC","MC","LC","LC","YA","YA","YA","YA","MC","LC","LC","MC")
level <-c("firsttime","new","new","active","active","active","firsttime","new","active","new","active","new","active","active")

type <- factor(type, levels=c("LC", "MC", "YA", "ST"))

level <- factor(level, levels=c("active", "new", "firsttime"))

data <- data.frame(ID,date,type,level)

df.right <- data %>%
  group_by(ID, date) %>%
  filter(type == sort(type)[1]) %>%
  filter(level == sort(level)[1])

Upvotes: 2

Related Questions