Reputation: 31
I got these duplicated records from ton of data. Now, I need to choose one row from these duplicated rows.
ID <- c("6820","6820","17413","17413","38553","38553","52760","52760","717841","717841","717841","747187","747187","747187")
date <- c("2014-06-12","2015-06-11","2014-05-01","2014-05-01","2014-06-12","2015-06-11","2014-10-24","2014-10-24","2014-05-01","2014-05-01","2014-12-02","2014-03-01","2014-05-12","2014-05-12")
type <- c("ST","ST","MC","MC","LC","LC","YA","YA","YA","YA","MC","LC","LC","MC")
level <-c("firsttime","new","new","active","active","active","firsttime","new","active","new","active","new","active","active")
data <- data.frame(ID,date,type,level)
The data frame will look like this:
I want to compare this: for each ID,if their dates are different, then keep all of them in df.right; if the date is same, then compare type, choose them in order of LC>MC>YA>ST (eg. choose MC over YA), put them into df.right; if type is same, then compare level, choose them in order of active>new>firsttime (eg. choose new over first time), and put the choosen into df.right.
I tried to use foreach, this is only on the first step, and it is not working for ID have 3 duplicated rows.
foreach (i=unique(data$ID), .combine='rbind') %do% {data[data$ID==i, "date"][1] == data[data$ID==i, "date"][2])
b <- data[data$ID==i,]}
The result should be like this: Does anybody knows how to do this? really appreciate it. Thank you
Upvotes: 3
Views: 390
Reputation: 2226
The trick here is to order the levels of type
and level
as appropriate. Then two deduplications are necessary: first, to remove duplicate rows based on the columns ID, date, type
and second, remove dup rows based upon first two columns:
type = factor(type, levels=c("ST","YA","MC","LC"))
level = factor(level, levels=c("active","new","firsttime"))
data <- data.frame(ID,date,type,level)
d = with(data, data[order(ID, date, type, level),])
e = d[-which(duplicated(d[,1:3])),]
df.right = e[-which(duplicated(e[,1:2])),]
df.right = df.right[order(as.numeric(as.character(df.right$ID)), df.right$date),]
df.right
Output:
ID date type level
1 6820 2014-06-12 ST firsttime
2 6820 2015-06-11 ST new
4 17413 2014-05-01 MC active
5 38553 2014-06-12 LC active
6 38553 2015-06-11 LC active
8 52760 2014-10-24 YA new
9 717841 2014-05-01 YA active
11 717841 2014-12-02 MC active
12 747187 2014-03-01 LC new
14 747187 2014-05-12 MC active
Upvotes: 1
Reputation: 7796
The dplyr
package is good for this sort of thing
Using factors, you can specify how you want your categories ordered. Then you can pick the first of each type and level for each unique ID/date pair.
library(dplyr)
ID <- c("6820","6820","17413","17413","38553","38553","52760","52760","717841","717841","717841","747187","747187","747187")
date <- c("2014-06-12","2015-06-11","2014-05-01","2014-05-01","2014-06-12","2015-06-11","2014-10-24","2014-10-24","2014-05-01","2014-05-01","2014-12-02","2014-03-01","2014-05-12","2014-05-12")
type <- c("ST","ST","MC","MC","LC","LC","YA","YA","YA","YA","MC","LC","LC","MC")
level <-c("firsttime","new","new","active","active","active","firsttime","new","active","new","active","new","active","active")
type <- factor(type, levels=c("LC", "MC", "YA", "ST"))
level <- factor(level, levels=c("active", "new", "firsttime"))
data <- data.frame(ID,date,type,level)
df.right <- data %>%
group_by(ID, date) %>%
filter(type == sort(type)[1]) %>%
filter(level == sort(level)[1])
Upvotes: 2