Reputation: 2189
I have a column in a df that has different id's. Some of the id's are duplicated. I am trying to compare the different id's (starting from first) and then see if the same id is present in the next line (row) of the column. If it is the same id then i do something and if not go to the next id and repeat the same. Here is the column in the df
V4
Contig1401|m.3412
Contig1428|m.3512
Contig1755|m.4465
Contig1755|m.4465
Contig1897|m.4878
Contig1897|m.4878
Contig1757|m.4476
Contig1598|m.4011
Contig1759|m.4481
Contig1685|m.4244
As you can see that there are id's that are duplicated and some are not. How do i go about it? So far i have written this.....
first_id <- "Contig1401|m.3412"
for (i in data$V4) {
if (i=first_id) {
do something.....
} else {
do something.
}
}
But i don't understand ho will i go after this. Basically i want to obtain this
V4 V5
Contig1401|m.3412 1
Contig1428|m.3512 1
Contig1755|m.4465 2
Contig1755|m.4465
Contig1897|m.4878 2
Contig1897|m.4878
Contig1757|m.4476 1
Contig1598|m.4011 1
Contig1759|m.4481 1
Contig1685|m.4244 1
Any ideas of how i can do this?
Thanks Upendra
Upvotes: 1
Views: 82
Reputation: 17611
Here is an idea:
dat <- read.table(text="V4
Contig1401|m.3412
Contig1428|m.3512
Contig1755|m.4465
Contig1755|m.4465
Contig1897|m.4878
Contig1897|m.4878
Contig1757|m.4476
Contig1598|m.4011
Contig1759|m.4481",header=T)
# Find entries where the next entry is the same
# Convert TRUE values to be 2 and FALSE values to be 1 by adding +1
same.as.next <- sapply(1:length(dat[,1]),
function(x)identical(dat[x,],dat[(x+1),]))+1
dat <- data.frame(dat,V5 = same.as.next)
dat[duplicated(dat$V4),]$V5 <- NA
# notice Contig1755|m.4465 shows up 5 times in this example
dat.multduplicates <- read.table(text="V4
Contig1401|m.3412
Contig1428|m.3512
Contig1755|m.4465
Contig1755|m.4465
Contig1755|m.4465
Contig1897|m.4878
Contig1897|m.4878
Contig1757|m.4476
Contig1598|m.4011
Contig1759|m.4481
Contig1755|m.4465
Contig1755|m.4465",header=T)
# same as before
same.as.next <- sapply(1:length(dat.multduplicates[,1]),
function(x)identical(dat.multduplicates[x,],dat.multduplicates[(x+1),]))+1
dat.multduplicates <- data.frame(dat.multduplicates,V5 = same.as.next)
# this approach should handle the duplicates
dat.multduplicates[which(dat.multduplicates$V5==2)+1,]$V5 <- NA
Upvotes: 1
Reputation: 25854
Not sure if this do what you want but this produces your final table
df <- read.table(text="Contig1401|m.3412
Contig1428|m.3512
Contig1755|m.4465
Contig1755|m.4465
Contig1897|m.4878
Contig1897|m.4878
Contig1757|m.4476
Contig1598|m.4011
Contig1759|m.4481
Contig1685|m.4244",header=F, stringsAsFactors=FALSE)
# One way
df$id <- duplicated(df$V1 , fromLast=T) + 1
df$id[duplicated(df$V1) ] <- NA
#or
df$id <- rep(rle(df$V1)$lengths,rle(df$V1)$lengths)
df$id[duplicated(df$V1) ] <- NA
Upvotes: 3