AwaitedOne
AwaitedOne

Reputation: 1012

New column based on identical rows of group

I have some data as shown below:

Sample_Name      RP   Outlier rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854
31  2011          25 -5.408103         AB        AA        AA         AB         AB         AB
33  10145         25 -5.205900         AB        BB        BB         AB         BB         AB
5   2300       10647 -5.361135         AA        AA        AA         AA         AB         AA
21  20110      10647 -5.043994         AA        AB        AA         BB         AB         BB
24  2013       10647 -5.480397         AA        AB        AA         BB         AB         BB
28  200        10647 -4.635197         AA        AB        AA         BB         AB         BB
2   20110      11458 -4.935565         BB        AA        BB         AA         AB         BB
9   2311       11458 -4.913464         BB        AA        BB         AA         AB         BB
32  200901     11458 -4.721801         BB        AA        BB         AA         AB         BB

I want to group them using RP column and check if row starting from 4th column are identical. I want to give different values (0/1) for different rows. If there are only two rows in a group and these two rows are different, give different values. If there are more than two rows, give identical rows same value and others a different value. All different rows in a group will have same value.

I tried group_by from dplyr but not sure how to proceed from there. The desired output is shown.

Output:

Sample_Name     RP   Outlier rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854   ID
31  2011          25 -5.408103         AB        AA        AA         AB         AB         AB   0
33  10145         25 -5.205900         AB        BB        BB         AB         BB         AB   1 
5   2300       10647 -5.361135         AA        AA        AA         AA         AB         AA   0
21  20110      10647 -5.043994         AA        AB        AA         BB         AB         BB   1
24  2013       10647 -5.480397         AA        AB        AA         BB         AB         BB   1
28  200        10647 -4.635197         AA        AB        AA         BB         AB         BB   1
2   20110      11458 -4.935565         BB        AA        BB         AA         AB         BB   1
9   2311       11458 -4.913464         BB        AA        BB         AA         AB         BB   1
32  200901     11458 -4.721801         BB        AA        BB         AA         AB         BB   1

Upvotes: 1

Views: 85

Answers (3)

Monk
Monk

Reputation: 407

This code below works. It creates two 'rank over partition's and subtracts one from the other. One note, I believe according to your logic above, the last grouping (11458) should all be zeros not one's because you start new ID groups at zero. The code below uses that updated logic.

# Create dataframe
df = read.table(text = '
Sample_Name      RP   Outlier rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854
31  2011   25 -5.408103  AB        AA        AA  AB  AB  AB
33  10145  25 -5.205900  AB        BB        BB  AB  BB  AB
5   2300       10647 -5.361135  AA        AA        AA  AA  AB  AA
21  20110      10647 -5.043994  AA        AB        AA  BB  AB  BB
24  2013       10647 -5.480397  AA        AB        AA  BB  AB  BB
28  200        10647 -4.635197  AA        AB        AA  BB  AB  BB
2   20110      11458 -4.935565  BB        AA        BB  AA  AB  BB
9   2311       11458 -4.913464  BB        AA        BB  AA  AB  BB
32  200901     11458 -4.721801  BB        AA        BB  AA  AB  BB
', header = T)

# Create rank by RP partition
df <- ddply(df, .(RP), transform, RP_rank= rank(RP, ties.method = "first"))

# Create rank by RP & rs partition
df$rskey <- paste0(df$rs10033147,df$rs1019916,df$rs1040870,df$rs10457834,df$rs10796216,df$rs10882854)
df <- ddply(df, .(RP, rskey), transform, RPrs_rank = rank(RP, rskey, ties.method = "first"))

# This is the key step.  Subtract one partition rank from the other.
df$ID <- df$RP_rank - df$RPrs_rank

# Remove unneeded columns
df$RP_rank <- NULL; df$rskey <- NULL; df$RPrs_rank <- NULL

Upvotes: 0

Lisardo Erman
Lisardo Erman

Reputation: 148

A solution with dplyr (only works if you do not have more than two unique rows per group):

library(tidyverse)

df %>% group_by(RP, rs10033147, rs1019916, rs1040870, rs10457834, rs10796216, rs10882854) %>% mutate(ID = ifelse(n() > 1, 1, 0)) %>% ungroup %>% group_by(RP) %>% mutate(ID = ifelse(n() == 2 & row_number() == 2, 1, ID)) %>% ungroup()

#Sample    RP   Outlier rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854 ID
#  2011    25 -5.408103         AB        AA        AA         AB         AB         AB  0
# 10145    25 -5.205900         AB        BB        BB         AB         BB         AB  1
#  2300 10647 -5.361135         AA        AA        AA         AA         AB         AA  0
# 20110 10647 -5.043994         AA        AB        AA         BB         AB         BB  1
#  2013 10647 -5.480397         AA        AB        AA         BB         AB         BB  1
#   200 10647 -4.635197         AA        AB        AA         BB         AB         BB  1
# 20110 11458 -4.935565         BB        AA        BB         AA         AB         BB  1
#  2311 11458 -4.913464         BB        AA        BB         AA         AB         BB  1
#200901 11458 -4.721801         BB        AA        BB         AA         AB         BB  1

Upvotes: 0

M--
M--

Reputation: 29238

library(data.table)

setDT(df1)[, temp := Reduce(function(...) paste(..., sep = "-"), 
                            .SD[, mget(names(df1)[startsWith(names(df1), "rs")])])][, 
                            ID := sprintf('%01d', rleid(temp)), by = RP][, temp := NULL][]

#Sample    RP   Outlier rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854 ID
#  2011    25 -5.408103         AB        AA        AA         AB         AB         AB  1
# 10145    25 -5.205900         AB        BB        BB         AB         BB         AB  2
#  2300 10647 -5.361135         AA        AA        AA         AA         AB         AA  1
# 20110 10647 -5.043994         AA        AB        AA         BB         AB         BB  2
#  2013 10647 -5.480397         AA        AB        AA         BB         AB         BB  2
#   200 10647 -4.635197         AA        AB        AA         BB         AB         BB  2
# 20110 11458 -4.935565         BB        AA        BB         AA         AB         BB  1
#  2311 11458 -4.913464         BB        AA        BB         AA         AB         BB  1
#200901 11458 -4.721801         BB        AA        BB         AA         AB         BB  1

Upvotes: 1

Related Questions