Reputation: 1120
That's how my data looks like:
> dput(data)
structure(list(Name = c("Mark", "Tere", "Marcus", "Heidi", "Georg",
"Tieme", "Joan", "Tytus", "Mark", "Tere", "Marcus", "Heidi",
"Georg", "Tieme", "Joan", "Tytus", "Mark", "Tere", "Marcus",
"Heidi", "Georg", "Tieme", "Joan", "Tytus", "Mark", "Tere", "Marcus",
"Heidi", "Georg", "Tieme", "Joan", "Tytus", "Mark", "Tere", "Marcus",
"Heidi", "Georg", "Tieme", "Joan", "Tytus", "Mark", "Tere", "Marcus",
"Heidi", "Georg", "Tieme", "Joan", "Tytus"), position = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("n",
"w"), class = "factor"), time = c(0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2), type = c(5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3), value = structure(c(0.314543267094949, 3.19693140536703,
0.930679293700748, 0.344465436879545, 1.12925174828143, 0.295226529240608,
0.534869402647018, 0.699742971466516, 0.899450918860775, 1.69467910557034,
0.105376912280917, 2.49469271475885, 0.0343013256788254, 0.469440043556817,
0.917351596015406, 1.46852763733807, 0.967872510622156, 0.354032471310347,
1.52519534333589, 1.04999953913746, 1.90298544048312, 0.265492846723646,
1.19947244036467, 2.54636084834549, 0.401142360642552, 0.190538682818794,
1.55882554492893, 1.27338900133492, 0.298795571085066, 1.73941845252159,
0.133721855003387, 0.115250693634152, 1.12742130434271, 1.50438847943189,
0.0286978152580559, 0.740798383999788, 0.838944700504553, 0.726393882538543,
1.79097043714466, 0.214124974329025, 1.46675111903789, 0.0172332774632657,
2.89433199895506, 3.07798819227104, 0.826817313167167, 0.72394085004025,
0.083957624156028, 1.07571752737732), .Dim = c(48L, 1L))), .Names = c("Name",
"position", "time", "type", "value"), row.names = c(NA, 48L), class = "data.frame")
I would like to perform t-test of value
in specific rows. How to find the rows for which t-test will be performed:
1. Name
has to be the same.
2. Type
has to be the same.
3. Time
= 0
is always a reference so t-test should be calculated between time
= 1
and time
= 0
, time
= 2
and time
= 0
, and so on. That's an example data so I can't tell you how many time
is in real data but 0 is always a reference.
4. Output (results of t-test) can be stored in additional column and in the row with specified time
.
5. Number of Names
and Times
I do not know. The data contains thousands of rows.
6. I believe that the loop is the only solution and anyway it is my desired code. Of course additional solution without the loop is more than welcome.
It's a bit of programming so I am going to start a bounty to reward someone who will put some effort on that.
Upvotes: 2
Views: 232
Reputation: 11514
You were asking explicitly for a solution involving a loop
. While there are other more appropriate ways of achieving this, let me show you an example with a loop here. More elegant solutions can be found here.
First, note that for some combinations, the number of observations will be too small to carry out a t-test. Let us therefore define a t-test function that will not disturb the continuation of the loop if this happens:
library(dplyr)
failproof.t <- failwith("t-test failed", t.test, quiet=T)
Now, let us pre-assign the relevant columns to the dataset.
df <- read.csv("Data_to_analyze.csv")
df$t_stat <- df$p_val <- NA
Now we can just loop through all combinations as follows:
for(protein in unique(df$protein)){
for(fract in unique(df$fract)){
for(tp in unique(df$tp)[-1]){ # -1 to exclude tp==0
out <- failproof.t(
df[df$protein==protein & df$fract==fract & df$tp==0, "abundance"],
df[df$protein==protein & df$fract==fract & df$tp==tp, "abundance"]
)
if(is.list(out)){
df[df$protein==protein & df$fract==fract & df$tp==tp, "t_stat"] <- out$statistic
df[df$protein==protein & df$fract==fract & df$tp==tp, "p_val"] <- out$p.value
}
}
}
}
This will give you the t-statistic and the p-values as a new column in the original dataset df
.
However, note a few things though:
There is, in fact, no reason to prefer this code over the above data.table
solution. Even though your superiors "strongly suggested" using a loop, you could educate them that there are better ways of achieving this.
Upvotes: 1
Reputation: 887571
We can try with data.table
library(data.table)
setDT(data, keep.rownames= TRUE)
dt1 <- data[time==0]
dt2 <- data[time!=0]
res <- dt2[dt1, on = c("Name", "type")][, .(r1= rn, r2 = i.rn, time0 = i.time,
Othertime = time, pval = t.test(value, i.value)$p.value), .(Name, type)]
res1 <- melt(res, measure = patterns("^r\\d+", "time"),
value.name= c("rn", "time"))[, variable := NULL][]
res2 <- unique(res1, by = c("Name", "type", "rn"))
res3 <- data[res2, on = "rn"][order(as.numeric(rn))][,
c('i.Name', 'i.type', 'i.time') := NULL][]
head(res3,3)
# rn Name position time type value pval
#1: 1 Mark n 0 5 0.3145433 0.03514213
#2: 2 Tere n 0 5 3.1969314 0.19052234
#3: 3 Marcus n 0 5 0.9306793 0.89741695
Using the OP's real data
file <- "C:/Users/akrun/Misc/Data_to_analyze.csv"
data1 <- fread(file)
dt1 <- data1[tp==0]
dt2 <- data1[tp!=0]
res <- dt2[dt1, on = c("protein", "fract"), allow.cartesian = TRUE
][, if(.N >1) {
.(r1= V1, r2 = i.V1, time0 = i.tp,Othertime = tp,
pval = t.test(abundance, i.abundance)$p.value)
} # else .(r1=V1, r2 = i.V1, time0 = i.tp, Othertime = tp, pval = NA_real_)
, by = .(protein, fract)]
res1 <- melt(res, measure = patterns("^r\\d+", "time"),
value.name= c("V1", "time"))[, variable := NULL][]
res2 <- unique(res1, by = c("protein", "fract", "V1"))
res3 <- data1[res2, on = "V1"][order(as.numeric(V1))][,
c('i.protein', 'i.fract') := NULL][]
head(res3)
# V1 protein line tp fract abundance pval time
#1: 1 PCP000002 n 0 1 46697305.3 0.515626208527 1
#2: 2 PCP000007 n 0 1 8384565.8 0.000643157099 1
#3: 3 PCP000008 n 0 1 570026.3 0.000006871077 1
#4: 4 PCP000012 n 0 1 1018257.3 0.085610886030 1
#5: 5 PCP000017 n 0 1 157877521.4 0.016528856828 1
#6: 6 PCP000018 n 0 1 426215586.5 0.046130079694 1
Upvotes: 5