Grouping rows based on duplicated values

Question

I'm having a blind spot with this particular problem for some reason.

Sample data:

         date id1 id2
   1888-09-08   A   E
   1888-09-15   B   M
   1888-09-22   C   L
   1888-09-22   D   B
   1888-09-29   A   R
   1888-10-27   F   G
   1888-11-17   J   H
   1888-11-17   E   O
   1888-11-24   H   F
   1888-12-15   G   L
   1889-01-19   K   G
   1889-02-09   M   E

What I am trying to do is create a 4th variable that will start at 1 and increase by 1 at the observation whenever a value from either 'id1' or 'id2' is repeated. It will increase by 1 again whenever a value is duplicated starting from the observation when it last increased.

Expected results:

         date id1 id2 value
   1888-09-08   A   E     1
   1888-09-15   B   M     1
   1888-09-22   C   L     1
   1888-09-22   D   B     2
   1888-09-29   A   R     2
   1888-10-27   F   G     2
   1888-11-17   J   H     2
   1888-11-17   E   O     2
   1888-11-24   H   F     3
   1888-12-15   G   L     3
   1889-01-19   K   G     4
   1889-02-09   M   E     4

As you can see, on the fourth row, the id "B" repeated as it had been seen before. Therefore the value variable increased to '2'. Starting with this row, we then search for the next duplicated 'id'. That happens on row 9, where a "H" repeats (it had been seen in a previous observation since the last 'value' change). The value then changes to 3, and next changes on the next row but one as a 'G' repeats.

I hope this makes sense.

I haven't got very far. My main idea consists of turning the two columns into a long vector and then find the first duplicated value. From this, we could work out that all elements that belonged to different rows would get a '1' whereas this element '8' and it's partner '7' would get a 2. We could go from the 7th element (the first in the pair) forward looking for more duplicates...

temp <- as.matrix(c(rbind(as.character(dfr[,2]), as.character(dfr[,3])))) #put two columns into single vector

min(which(duplicated(temp))) #8

I'm not sure how effective this will be.

Jason · Accepted Answer

This solution isn't going to be fast but it works.

x<-sample(LETTERS[1:5],40,replace=TRUE)
y<-sample(LETTERS[1:5],40,replace=TRUE)
df<-data.frame(x=x,y=y,value=1)
df$x<-as.character(df$x)
df$y<-as.character(df$y)

vec<-c(df[1,c('x','y')])
for(i in 2:nrow(df)){

    if(df[i,'x'] %in% vec | df[i,'y'] %in% vec) {  #checking whether in vector

        print('in') #printing when duplicates are found
        df[i,'value']=df[i-1,'value']+1 #adding to the value
        vec<-c(df[i,c('x','y')]) #resetting the vector

    } else {

        df[i,'value']=df[i-1,'value'] 
        vec<-c(vec,df[i,c('x','y')])   #if not found i add to the vector 

    }

}

Grouping rows based on duplicated values

Answers (1)

Related Questions