allisonrs
allisonrs

Reputation: 102

R: Replace values in one dataframe with value from another dataframe if conditions are met, otherwise append skipped data

From my instrumentation, I receive two different .tsv files containing my data. The first file contains, among other things, the name of the sample, its position in a 12x8 grid, and its output data. The second file contains average data from replicate sets based off the first text file. I've re-created an example of the two files in these data frames -- I actually read them using the read.table() function.

#re-creation of first .tsv file
Data <- data.frame(Name = c("100", "100", "200", "250", "1E5", "1E5", "Negative", "Negative"),
                   Pos = c("A3", "A4", "B3", "B4", "C3", "C4", "D3", "D4"),
                   Output = c("20.00", "20.10", "21.67", "23.24", "21.97", "22.03", "38.99", "38.99"))

Data
      Name Pos Output
1      100  A3  20.00
2      100  A4  20.10
3      200  B3  21.67
4      250  B4  23.24
5      1E5  C3  21.97
6      1E5  C4  22.03
7 Negative  D3  38.99
8 Negative  D4  38.99

#re-creation of second .tsv file
Replicates <- data.frame(Replicates = c("A3, A4", "C3, C4", "D3, D4"),
                         Mean.Cq = c(20.05, 22.00, 38.99)
                         STD.Cq = c(0.05, 0.03, 0.00))

Replicates

  Replicates Mean.Cq STD.Cq
1     A3, A4 20.05   0.05
2     C3, C4 22.00   0.03
3     D3, D4 38.99   0.00

This is what I'm trying to create:

#Rename values in Replicates$Name with value in Data$Name if replicate is present; append with non-replicate data 

          Name  Mean.Cq STD.Cq
1          100  20.05   0.05
2          1E5  22.00   NA
3     Negative  38.99   NA
4          200  21.67   0.03
5          250  23.24   0.00

I can do this manually by creating a dataframe using stringr and rbind.fill from slices of the "Data" dataframe such that I keep the first instance of each name of the replicates, then remove the $Replicates column from the "Replicates" dataframe and replace it with the $Name column of the newly-created sliced dataframe. I can then append the rows of non-replicate samples to the "Replicates" dataframe. However, not all of my files have the exact same pattern of replicates, or number of samples.

I have been trying in vein to mimic this example such that I can do this process for each file set regardless of the order or number of replicates, instead of going through each one and cleaning by hand. How do I transform this manual process into a for loop to keep from having to make a bunch of sliced dataframes?

The first part of my problem I think has been the ability to detect only part of the Replicates$Replicates pattern in Data$Name , not just the individual characters For example detect either A3 OR A4 from Replicates$Replicates[1] in Data$Name, then replace the value of Replicates$Replicates with the value of the first match found in Data$Name. I'm stuck at this step.

> str_replace(Replicates$Replicates, (str_detect(Data$Name, "[Replicates$Replicates]")))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Any insight would be super helpful! Sill new to programming, bioinformatics, and data science and I'm trying to figure it out as I go on my data.

EDIT

Thank you, @Skaqqs for helping to answer the question. I made edits from his answer to fit this in the tidyverse, which I have been finding a bit easier to adapt to than base R. Splitting the replicates into two columns, then sorting and joining did the trick (and that's where I was getting stuck).

require(tidyverse)

Samples <- tibble(Name = c("100", "100", "200", "250", "1E5", "1E5", 
                            "Negative", "Negative"),
                  Pos = c("A3", "A4", "B3", "B4", "C3", "C4", "D3", "D4"),
                  Output = c("20.00", "20.10", "21.67", "23.24", "21.97", 
                           "22.03", "38.99", "38.99")) 

Replicates <- tibble(Replicates = c("A3, A4", "C3, C4", "D3, D4"),
                     Mean.Cq = c(20.05, 22.00, 38.99),
                     STD.Cq = c(0.05, 0.03, 0.00))

Samples %>%
        .[str_order(.$Pos, numeric = TRUE),]    

Replicates %>%  
        mutate("R1" = gsub(x = Replicates, pattern = "^(.*),.*", replacement = "\\1")) %>%
        mutate("R2" = gsub(x = Replicates, pattern = ".*,\\s(.*)", replacement = "\\1")) %>%
        pivot_longer(cols = c("R1", "R2"), names_to ="Well Pairs", values_to = "Wells") %>% 
        select("Mean.Cq", "STD.Cq", "Wells") %>% 
        relocate("Wells", 1) %>%
        right_join(Samples, by = c("Wells"="Pos")) %>%
        .[str_order(.$Wells, numeric = TRUE),] %>% 
        select("Name", "Mean.Cq", "STD.Cq") %>% 
        distinct(Name, .keep_all = TRUE)    
        
    # A tibble: 5 x 3
      Name     Mean.Cq STD.Cq
      <chr>      <dbl>  <dbl>
    1 100         20.0   0.05
    2 200         NA    NA   
    3 250         NA    NA   
    4 1E5         22     0.03
    5 Negative    39.0   0 

Upvotes: 2

Views: 1338

Answers (1)

Skaqqs
Skaqqs

Reputation: 4140

This sounds like a join/merge question to me. My suggestion is to split Replicates$Replicates into two fields and essentially treat their data separately too. Then after joining your two Replicates tables with Data, use unique() to drop duplicates in your summary table.

library(dplyr)

# Split `Replicates$Replicates` into two fields
# This assumes your `Replicates` field is has two values, seperated by a comma and whitespace 
Replicates$R1 <- gsub(x = Replicates$Replicates, pattern = "^(.*),.*", replacement = "\\1")
Replicates$R2 <- gsub(x = Replicates$Replicates, pattern = ".*,\\s(.*)", replacement = "\\1")

# Inner-join `Data` and `Replicates` by `R1` and `R2`
df <- merge(Data, Replicates,  by.x = "Pos", by.y = "R1", all.x = FALSE)
df2 <- merge(Data, Replicates, by.x = "Pos", by.y = "R2", all.x = FALSE)
df3 <- dplyr::bind_rows(df, df2)

unique(df3[,c("Name", "Mean.Cq", "STD.Cq")])
#>       Name Mean.Cq STD.Cq
#> 1      100   20.05   0.05
#> 2      1E5   22.00   0.03
#> 3 Negative   38.99   0.00

Upvotes: 2

Related Questions