Reputation: 102
From my instrumentation, I receive two different .tsv files containing my data. The first file contains, among other things, the name of the sample, its position in a 12x8 grid, and its output data. The second file contains average data from replicate sets based off the first text file. I've re-created an example of the two files in these data frames -- I actually read them using the read.table()
function.
#re-creation of first .tsv file
Data <- data.frame(Name = c("100", "100", "200", "250", "1E5", "1E5", "Negative", "Negative"),
Pos = c("A3", "A4", "B3", "B4", "C3", "C4", "D3", "D4"),
Output = c("20.00", "20.10", "21.67", "23.24", "21.97", "22.03", "38.99", "38.99"))
Data
Name Pos Output
1 100 A3 20.00
2 100 A4 20.10
3 200 B3 21.67
4 250 B4 23.24
5 1E5 C3 21.97
6 1E5 C4 22.03
7 Negative D3 38.99
8 Negative D4 38.99
#re-creation of second .tsv file
Replicates <- data.frame(Replicates = c("A3, A4", "C3, C4", "D3, D4"),
Mean.Cq = c(20.05, 22.00, 38.99)
STD.Cq = c(0.05, 0.03, 0.00))
Replicates
Replicates Mean.Cq STD.Cq
1 A3, A4 20.05 0.05
2 C3, C4 22.00 0.03
3 D3, D4 38.99 0.00
This is what I'm trying to create:
#Rename values in Replicates$Name with value in Data$Name if replicate is present; append with non-replicate data
Name Mean.Cq STD.Cq
1 100 20.05 0.05
2 1E5 22.00 NA
3 Negative 38.99 NA
4 200 21.67 0.03
5 250 23.24 0.00
I can do this manually by creating a dataframe using stringr
and rbind.fill
from slices of the "Data" dataframe such that I keep the first instance of each name of the replicates, then remove the $Replicates
column from the "Replicates" dataframe and replace it with the $Name
column of the newly-created sliced dataframe. I can then append the rows of non-replicate samples to the "Replicates" dataframe. However, not all of my files have the exact same pattern of replicates, or number of samples.
I have been trying in vein to mimic this example such that I can do this process for each file set regardless of the order or number of replicates, instead of going through each one and cleaning by hand. How do I transform this manual process into a for loop to keep from having to make a bunch of sliced dataframes?
The first part of my problem I think has been the ability to detect only part of the Replicates$Replicates
pattern in Data$Name
, not just the individual characters
For example detect either A3 OR A4 from Replicates$Replicates[1]
in Data$Name
, then replace the value of Replicates$Replicates
with the value of the first match found in Data$Name
. I'm stuck at this step.
> str_replace(Replicates$Replicates, (str_detect(Data$Name, "[Replicates$Replicates]")))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Any insight would be super helpful! Sill new to programming, bioinformatics, and data science and I'm trying to figure it out as I go on my data.
EDIT
Thank you, @Skaqqs for helping to answer the question. I made edits from his answer to fit this in the tidyverse, which I have been finding a bit easier to adapt to than base R. Splitting the replicates into two columns, then sorting and joining did the trick (and that's where I was getting stuck).
require(tidyverse)
Samples <- tibble(Name = c("100", "100", "200", "250", "1E5", "1E5",
"Negative", "Negative"),
Pos = c("A3", "A4", "B3", "B4", "C3", "C4", "D3", "D4"),
Output = c("20.00", "20.10", "21.67", "23.24", "21.97",
"22.03", "38.99", "38.99"))
Replicates <- tibble(Replicates = c("A3, A4", "C3, C4", "D3, D4"),
Mean.Cq = c(20.05, 22.00, 38.99),
STD.Cq = c(0.05, 0.03, 0.00))
Samples %>%
.[str_order(.$Pos, numeric = TRUE),]
Replicates %>%
mutate("R1" = gsub(x = Replicates, pattern = "^(.*),.*", replacement = "\\1")) %>%
mutate("R2" = gsub(x = Replicates, pattern = ".*,\\s(.*)", replacement = "\\1")) %>%
pivot_longer(cols = c("R1", "R2"), names_to ="Well Pairs", values_to = "Wells") %>%
select("Mean.Cq", "STD.Cq", "Wells") %>%
relocate("Wells", 1) %>%
right_join(Samples, by = c("Wells"="Pos")) %>%
.[str_order(.$Wells, numeric = TRUE),] %>%
select("Name", "Mean.Cq", "STD.Cq") %>%
distinct(Name, .keep_all = TRUE)
# A tibble: 5 x 3
Name Mean.Cq STD.Cq
<chr> <dbl> <dbl>
1 100 20.0 0.05
2 200 NA NA
3 250 NA NA
4 1E5 22 0.03
5 Negative 39.0 0
Upvotes: 2
Views: 1338
Reputation: 4140
This sounds like a join/merge question to me. My suggestion is to split Replicates$Replicates
into two fields and essentially treat their data separately too. Then after joining your two Replicates
tables with Data
, use unique()
to drop duplicates in your summary table.
library(dplyr)
# Split `Replicates$Replicates` into two fields
# This assumes your `Replicates` field is has two values, seperated by a comma and whitespace
Replicates$R1 <- gsub(x = Replicates$Replicates, pattern = "^(.*),.*", replacement = "\\1")
Replicates$R2 <- gsub(x = Replicates$Replicates, pattern = ".*,\\s(.*)", replacement = "\\1")
# Inner-join `Data` and `Replicates` by `R1` and `R2`
df <- merge(Data, Replicates, by.x = "Pos", by.y = "R1", all.x = FALSE)
df2 <- merge(Data, Replicates, by.x = "Pos", by.y = "R2", all.x = FALSE)
df3 <- dplyr::bind_rows(df, df2)
unique(df3[,c("Name", "Mean.Cq", "STD.Cq")])
#> Name Mean.Cq STD.Cq
#> 1 100 20.05 0.05
#> 2 1E5 22.00 0.03
#> 3 Negative 38.99 0.00
Upvotes: 2