Reputation: 411
I tried to solve this problem in PERL, but it only works with lesser data, so I need a solution in R, which I guess is faster and easier then PERL, anyway. I got one file like this one with two positions in the genome ( first and second column) and the distance bteween them (third column)
cg00000029 cg01016459 848
cg00000029 cg02021817 38
cg00000029 cg02851944 13
cg00000029 cg02976952 238
cg00000029 cg03943270 93
cg00000029 cg07396495 604
cg00000029 cg12190057 929
And my second file is this one, with the position in the genome and one expression value in each column, for each sample ( 1 to 6)
TargetID sample1 sample2 sample3 sample4 sample5 sample6
cg00000029 0.157 0.444 0.466 0.805 0.5489 0.448
cg01016459 0.873 0.930 0.926 0.942 0.932 0.9128
cg03943270 0.871 0.920 0.926 0.942 0.942 0.942
In fact I have 100 samples. My idea is to get a final file for each sample with the expression values instead the cg's and the distance. For example, for sample 1
0.157 0.873 848
0.157 0.871 93
for sample 2
0.444 0.930 848
0.444 0.920 93
In PERL I have no problems when I got only two samples, I load the files in two estructures, hashes of arrays, and then I compare them using nested foreach loops, but it take so much time only for 2 samples, imagine 100! I tried in R, loading the data in 2 data frames and use something as
expression[rownames(expression) %in% rownames(distances),]
the problem is that I need something like a loop or apply function to iterate over the expression data using the first cpg value and then the second , if they are in pairs in expression, put the expression values and the distances.
Any ideas would be welcome
Thanks in advance
`
Upvotes: 0
Views: 991
Reputation: 32351
You can join the two data.frames with merge
, convert the result to the tall format with melt
, and then apply a function (e.g., to save to a file) on each piece of the result with d_pply
.
# Sample data
n <- length(LETTERS)
d1 <- cbind( expand.grid( LETTERS, LETTERS ), rnorm( n*n ) )
names(d1) <- c("id1", "id2", "distance")
d1 <- d1[ as.character(d1$id1) < as.character(d1$id2), ]
d2 <- as.data.frame( matrix( rnorm(n*6), nr=n ) )
d2 <- data.frame( id=LETTERS, d2 )
names( d2 )[-1] <- paste( "sample", 1:6, sep="")
# If the distance data.frame only contains half the pairs,
# i.e., if it only contains one of (a,b) and (b,a),
# add the missing ones.
d1a <- d1
d1b <- d1[,c(2,1,3)]
names(d1b) <- names(d1a)
d1 <- rbind( d1a, d1b )
d1 <- d1[ ! duplicated( d1[,1:2]), ]
# Merge the two data.frames
d <- merge( d1, d2, by.x="id1", by.y="id" )
# Convert to tall format
library(reshape2)
d <- melt(d, id.vars=c("id1", "id2", "distance"))
# Apply a function to each chunk
d_ply( d, "variable", function (u) {
cat( "Would save ", nrow(u), " rows to ", as.character(u$variable[1]), "\n" )
} )
Upvotes: 0
Reputation: 43255
if your first data is in dat
structure(list(V1 = c("cg00000029", "cg00000029", "cg00000029",
"cg00000029", "cg00000029", "cg00000029", "cg00000029"), V2 = c("cg01016459",
"cg02021817", "cg02851944", "cg02976952", "cg03943270", "cg07396495",
"cg12190057"), V3 = c(848L, 38L, 13L, 238L, 93L, 604L, 929L)), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -7L))
and second set is in target
structure(list(TargetID = c("cg00000029", "cg01016459", "cg03943270"
), sample1 = c(0.157, 0.873, 0.871), sample2 = c(0.444, 0.93,
0.92), sample3 = c(0.466, 0.926, 0.926), sample4 = c(0.805, 0.942,
0.942), sample5 = c(0.5489, 0.932, 0.942), sample6 = c(0.448,
0.9128, 0.942)), .Names = c("TargetID", "sample1", "sample2",
"sample3", "sample4", "sample5", "sample6"), class = "data.frame", row.names = c(NA,
-3L))
match()
will get you what you're looking for. I would use reshape and plyr packages. Specifically melt
and ddply
but I'm sure there is a apply version too.
target.melt <- melt(target,id.var='TargetID')
my.func <- function(lookup,df) {
cg.one <- lookup$value[match(df$V1,lookup$TargetID)]
cg.two <- lookup$value[match(df$V2,lookup$TargetID)]
return(list(cgone=cg.one,cgtwo=cg.two,distance=df$V3))
}
out <- dlply(target.melt,.(variable),my.func,df=dat)
there are a bunch of NAs with your data since the second data set is incomplete but what you asked for is there:
> na.omit(as.data.frame(out[[1]]))
cgone cgtwo distance
1 0.157 0.873 848
5 0.157 0.871 93
>
Upvotes: 2