Reputation: 1012
I have some data as shown below:
Sample_Name RP Outlier rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854
31 2011 25 -5.408103 AB AA AA AB AB AB
33 10145 25 -5.205900 AB BB BB AB BB AB
5 2300 10647 -5.361135 AA AA AA AA AB AA
21 20110 10647 -5.043994 AA AB AA BB AB BB
24 2013 10647 -5.480397 AA AB AA BB AB BB
28 200 10647 -4.635197 AA AB AA BB AB BB
2 20110 11458 -4.935565 BB AA BB AA AB BB
9 2311 11458 -4.913464 BB AA BB AA AB BB
32 200901 11458 -4.721801 BB AA BB AA AB BB
I want to group them using RP
column and check if row starting from 4th column are identical. I want to give different values (0/1) for different rows. If there are only two rows in a group and these two rows are different, give different values. If there are more than two rows, give identical rows same value and others a different value. All different rows in a group will have same value.
I tried group_by
from dplyr
but not sure how to proceed from there. The desired output is shown.
Output:
Sample_Name RP Outlier rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854 ID
31 2011 25 -5.408103 AB AA AA AB AB AB 0
33 10145 25 -5.205900 AB BB BB AB BB AB 1
5 2300 10647 -5.361135 AA AA AA AA AB AA 0
21 20110 10647 -5.043994 AA AB AA BB AB BB 1
24 2013 10647 -5.480397 AA AB AA BB AB BB 1
28 200 10647 -4.635197 AA AB AA BB AB BB 1
2 20110 11458 -4.935565 BB AA BB AA AB BB 1
9 2311 11458 -4.913464 BB AA BB AA AB BB 1
32 200901 11458 -4.721801 BB AA BB AA AB BB 1
Upvotes: 1
Views: 85
Reputation: 407
This code below works. It creates two 'rank over partition's and subtracts one from the other. One note, I believe according to your logic above, the last grouping (11458) should all be zeros not one's because you start new ID groups at zero. The code below uses that updated logic.
# Create dataframe
df = read.table(text = '
Sample_Name RP Outlier rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854
31 2011 25 -5.408103 AB AA AA AB AB AB
33 10145 25 -5.205900 AB BB BB AB BB AB
5 2300 10647 -5.361135 AA AA AA AA AB AA
21 20110 10647 -5.043994 AA AB AA BB AB BB
24 2013 10647 -5.480397 AA AB AA BB AB BB
28 200 10647 -4.635197 AA AB AA BB AB BB
2 20110 11458 -4.935565 BB AA BB AA AB BB
9 2311 11458 -4.913464 BB AA BB AA AB BB
32 200901 11458 -4.721801 BB AA BB AA AB BB
', header = T)
# Create rank by RP partition
df <- ddply(df, .(RP), transform, RP_rank= rank(RP, ties.method = "first"))
# Create rank by RP & rs partition
df$rskey <- paste0(df$rs10033147,df$rs1019916,df$rs1040870,df$rs10457834,df$rs10796216,df$rs10882854)
df <- ddply(df, .(RP, rskey), transform, RPrs_rank = rank(RP, rskey, ties.method = "first"))
# This is the key step. Subtract one partition rank from the other.
df$ID <- df$RP_rank - df$RPrs_rank
# Remove unneeded columns
df$RP_rank <- NULL; df$rskey <- NULL; df$RPrs_rank <- NULL
Upvotes: 0
Reputation: 148
A solution with dplyr
(only works if you do not have more than two unique rows per group):
library(tidyverse)
df %>% group_by(RP, rs10033147, rs1019916, rs1040870, rs10457834, rs10796216, rs10882854) %>% mutate(ID = ifelse(n() > 1, 1, 0)) %>% ungroup %>% group_by(RP) %>% mutate(ID = ifelse(n() == 2 & row_number() == 2, 1, ID)) %>% ungroup()
#Sample RP Outlier rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854 ID
# 2011 25 -5.408103 AB AA AA AB AB AB 0
# 10145 25 -5.205900 AB BB BB AB BB AB 1
# 2300 10647 -5.361135 AA AA AA AA AB AA 0
# 20110 10647 -5.043994 AA AB AA BB AB BB 1
# 2013 10647 -5.480397 AA AB AA BB AB BB 1
# 200 10647 -4.635197 AA AB AA BB AB BB 1
# 20110 11458 -4.935565 BB AA BB AA AB BB 1
# 2311 11458 -4.913464 BB AA BB AA AB BB 1
#200901 11458 -4.721801 BB AA BB AA AB BB 1
Upvotes: 0
Reputation: 29238
library(data.table)
setDT(df1)[, temp := Reduce(function(...) paste(..., sep = "-"),
.SD[, mget(names(df1)[startsWith(names(df1), "rs")])])][,
ID := sprintf('%01d', rleid(temp)), by = RP][, temp := NULL][]
#Sample RP Outlier rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854 ID
# 2011 25 -5.408103 AB AA AA AB AB AB 1
# 10145 25 -5.205900 AB BB BB AB BB AB 2
# 2300 10647 -5.361135 AA AA AA AA AB AA 1
# 20110 10647 -5.043994 AA AB AA BB AB BB 2
# 2013 10647 -5.480397 AA AB AA BB AB BB 2
# 200 10647 -4.635197 AA AB AA BB AB BB 2
# 20110 11458 -4.935565 BB AA BB AA AB BB 1
# 2311 11458 -4.913464 BB AA BB AA AB BB 1
#200901 11458 -4.721801 BB AA BB AA AB BB 1
Upvotes: 1