BIN
BIN

Reputation: 781

Check duplicated for all of row by group in R

I have data set, I want to create a variable check to check there is any row of variable day in group IDdifferent.

df <- data.frame(ID=c("id1", "id1","id2", "id2","id3","id3","id3"),
             day=c("01/02/2008","01/02/2008","10/02/2009","08/03/2009","11/08/2007","11/08/2007","11/08/2008"),
             it =c("ul","tr","cb","ul","ul","tc","tr"))
df$day <- as.Date(as.character(df$day), format = "%m/%d/%Y")
  ID        day it
1 id1 2008-01-02 ul
2 id1 2008-01-02 tr
3 id2 2009-10-02 cb
4 id2 2009-08-03 ul
5 id3 2007-11-08 ul
6 id3 2007-11-08 tc
7 id3 2008-11-08 tr

The problem when I use this code, id3 listed in duplicated, because it have 2 rows the same, but I want all of row need to be the same, if not I don't count.

c <- df[duplicated(df$ID) & duplicated(df$day),]
df1 <- df[df$ID %in% c$ID,]
   ID        day it
1 id1 2008-01-02 ul
2 id1 2008-01-02 tr
5 id3 2007-11-08 ul
6 id3 2007-11-08 tc
7 id3 2008-11-08 tr

I want the output like this

   ID        day it check
1 id1 2008-01-02 ul   Yes
2 id1 2008-01-02 tr   Yes
3 id2 2009-10-02 cb    No
4 id2 2009-08-03 ul    No
5 id3 2007-11-08 ul    No
6 id3 2007-11-08 tc    No
7 id3 2008-11-08 tr    No

Upvotes: 3

Views: 2296

Answers (3)

acylam
acylam

Reputation: 18681

Here's a solution using dplyr:

library(dplyr)

dupsID = df %>%
  group_by(ID, day) %>%
  distinct(.keep_all = TRUE) %>%
  group_by(ID) %>%
  filter(n() == 1)

First group by ID then by day, extract the distinct elements. Note that "id3 2007-11-08" and "id3 2008-11-08" would be counted as distinct since I grouped by both ID and day. .keep_all = TRUE means that I want to keep all the columns. The second group_by(ID) groups the result from distinct only by ID. I then extracted samples with only one row for each group. This results in:

# > dupsID
# Source: local data frame [1 x 3]
# Groups: ID [1]
# 
#          ID        day     it
#     <fctr>     <date> <fctr>
#   1    id1 2008-01-02     ul

Finally, I check which ID in df are in ID column of dupsID

df$check = df$ID %in% dupsID$ID

# > df
#    ID        day it check
# 1 id1 2008-01-02 ul  TRUE
# 2 id1 2008-01-02 tr  TRUE
# 3 id2 2009-10-02 cb FALSE
# 4 id2 2009-08-03 ul FALSE
# 5 id3 2007-11-08 ul FALSE
# 6 id3 2007-11-08 tc FALSE
# 7 id3 2008-11-08 tr FALSE

Upvotes: 3

d.b
d.b

Reputation: 32548

First, split df by ID. Then, for each subgroup, check if ID and day are same (performed by pasting them together in this example). This will give a logical for each ID. Then you can store that in a variable (temp in this example) and lookup from that.

temp = sapply(split(df, df$ID), function(a)
      length(unique(paste(a$ID, a$day))) == 1 & nrow(a) > 1)
temp[match(df$ID, names(temp))]
#id1   id1   id2   id2   id3   id3   id3 
# TRUE  TRUE FALSE FALSE FALSE FALSE FALSE 

Upvotes: 2

Sathish
Sathish

Reputation: 12713

within(df,  # attach the columns of df in a separate environment using within()
       check <- unlist( by(df,   # group df by ID using by()
                           INDICES = ID, # check for unique of days with length = 1, if so return true else false
                           FUN = function(x) rep( length( unique( x$day ) ) == 1, length(x$day) ) ) 
                        ))
#    ID        day it check
# 1 id1 01/02/2008 ul  TRUE
# 2 id1 01/02/2008 tr  TRUE
# 3 id2 10/02/2009 cb FALSE
# 4 id2 08/03/2009 ul FALSE
# 5 id3 11/08/2007 ul FALSE
# 6 id3 11/08/2007 tc FALSE
# 7 id3 11/08/2008 tr FALSE

Data:

df <- data.frame(ID=c("id1", "id1","id2", "id2","id3","id3","id3"),
                 day=c("01/02/2008","01/02/2008","10/02/2009","08/03/2009","11/08/2007","11/08/2007","11/08/2008"),
                 it =c("ul","tr","cb","ul","ul","tc","tr"))

Upvotes: 2

Related Questions