Randy
Randy

Reputation: 63

Comparing Duplicate Samples

I have a data frame with set of 1200 individual cases in duplicate in one column for a total of 2400. i.e. A1.1234567_10, A1.1234567_20, There are multiple columns that I would like to compare such that each duplicate pair has the same or discrepant result in each column. columns contain factors How can I make it so that it can give a logical for my factors. I want to select each case by its ID (i.e A1.1234567) that matches _10 and _20:

EXAMPLE (one row of data frame)

A1.1234567_10 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL

A1.1234567_20 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL ABNORMAL NORMAL 

Id like the output to look like this(new data frame)

A1.1234567 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE

And this would repeat for all of the samples down the column by unique ID number comparing _10 and _20

Upvotes: 1

Views: 160

Answers (2)

acylam
acylam

Reputation: 18691

Another approach with tidyverse (credits to @alistaire's dput):

library(tidyverse)
library(stringr)
df %>%
  group_by(ID = str_extract(ID, ".+(?=_)")) %>%
  summarize_all(funs(dim(table(.)) == 1))

Result:

# A tibble: 1 x 9
          ID  var1  var2  var3  var4  var5  var6  var7  var8
       <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 A1.1234567  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

Upvotes: 0

alistaire
alistaire

Reputation: 43354

Here's a tidyverse option:

library(tidyverse)

df <- structure(list(ID = c("A1.1234567_10", "A1.1234567_20"), 
                     var1 = c("NORMAL", "NORMAL"), 
                     var2 = c("NORMAL", "NORMAL"), 
                     var3 = c("NORMAL", "NORMAL"), 
                     var4 = c("NORMAL", "NORMAL"), 
                     var5 = c("NORMAL", "NORMAL"), 
                     var6 = c("NORMAL", "NORMAL"), 
                     var7 = c("NORMAL", "ABNORMAL"), 
                     var8 = c("NORMAL", "NORMAL")), 
                .Names = c("ID", "var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8"), 
                class = "data.frame", row.names = c(NA, -2L))

# separate group variable from observation label
df_tidy <- df %>% separate(ID, c('ID', 'obs'), sep = '_')

df_tidy
#>           ID obs   var1   var2   var3   var4   var5   var6     var7   var8
#> 1 A1.1234567  10 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL   NORMAL NORMAL
#> 2 A1.1234567  20 NORMAL NORMAL NORMAL NORMAL NORMAL NORMAL ABNORMAL NORMAL

df_tidy %>% 
    select(-obs) %>% 
    group_by(ID) %>% 
    summarise_all(lift(`==`))
#> # A tibble: 1 x 9
#>           ID  var1  var2  var3  var4  var5  var6  var7  var8
#>        <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 A1.1234567  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

Upvotes: 3

Related Questions