upendra
upendra

Reputation: 2189

compare different id's in a column in r

I have a column in a df that has different id's. Some of the id's are duplicated. I am trying to compare the different id's (starting from first) and then see if the same id is present in the next line (row) of the column. If it is the same id then i do something and if not go to the next id and repeat the same. Here is the column in the df

     V4
Contig1401|m.3412
Contig1428|m.3512
Contig1755|m.4465
Contig1755|m.4465
Contig1897|m.4878
Contig1897|m.4878
Contig1757|m.4476
Contig1598|m.4011
Contig1759|m.4481
Contig1685|m.4244

As you can see that there are id's that are duplicated and some are not. How do i go about it? So far i have written this.....

first_id <- "Contig1401|m.3412"

    for (i in data$V4) {
      if (i=first_id) {
        do something.....
      } else {
        do something.
      }
    }

But i don't understand ho will i go after this. Basically i want to obtain this

       V4          V5
Contig1401|m.3412  1
Contig1428|m.3512  1
Contig1755|m.4465  2
Contig1755|m.4465  
Contig1897|m.4878  2
Contig1897|m.4878
Contig1757|m.4476  1
Contig1598|m.4011  1
Contig1759|m.4481  1
Contig1685|m.4244  1

Any ideas of how i can do this?

Thanks Upendra

Upvotes: 1

Views: 82

Answers (2)

Jota
Jota

Reputation: 17611

Here is an idea:

dat <- read.table(text="V4
Contig1401|m.3412
Contig1428|m.3512
Contig1755|m.4465
Contig1755|m.4465
Contig1897|m.4878
Contig1897|m.4878
Contig1757|m.4476
Contig1598|m.4011
Contig1759|m.4481",header=T)

# Find entries where the next entry is the same
# Convert TRUE values to be 2 and FALSE values to be 1 by adding +1
same.as.next <- sapply(1:length(dat[,1]),
                function(x)identical(dat[x,],dat[(x+1),]))+1

dat <- data.frame(dat,V5 = same.as.next)

dat[duplicated(dat$V4),]$V5 <- NA

Edit to address OP's comment regarding entries that show up > 2 times

# notice Contig1755|m.4465 shows up 5 times in this example
dat.multduplicates <- read.table(text="V4
Contig1401|m.3412
Contig1428|m.3512
Contig1755|m.4465
Contig1755|m.4465
Contig1755|m.4465
Contig1897|m.4878
Contig1897|m.4878
Contig1757|m.4476
Contig1598|m.4011
Contig1759|m.4481
Contig1755|m.4465
Contig1755|m.4465",header=T)

# same as before
same.as.next <- sapply(1:length(dat.multduplicates[,1]),
            function(x)identical(dat.multduplicates[x,],dat.multduplicates[(x+1),]))+1
dat.multduplicates <- data.frame(dat.multduplicates,V5 = same.as.next)

# this approach should handle the duplicates
dat.multduplicates[which(dat.multduplicates$V5==2)+1,]$V5 <- NA

Upvotes: 1

user20650
user20650

Reputation: 25854

Not sure if this do what you want but this produces your final table

df <- read.table(text="Contig1401|m.3412
Contig1428|m.3512
Contig1755|m.4465
Contig1755|m.4465
Contig1897|m.4878
Contig1897|m.4878
Contig1757|m.4476
Contig1598|m.4011
Contig1759|m.4481
Contig1685|m.4244",header=F,  stringsAsFactors=FALSE)

# One way
df$id <- duplicated(df$V1 , fromLast=T) + 1 
df$id[duplicated(df$V1) ] <- NA

#or
df$id <- rep(rle(df$V1)$lengths,rle(df$V1)$lengths)
df$id[duplicated(df$V1) ] <- NA

Upvotes: 3

Related Questions