user976991
user976991

Reputation: 411

Iterating two big data frames in R. Comparing 2 different positions at the same time using conditions

I tried to solve this problem in PERL, but it only works with lesser data, so I need a solution in R, which I guess is faster and easier then PERL, anyway. I got one file like this one with two positions in the genome ( first and second column) and the distance bteween them (third column)

cg00000029  cg01016459  848
cg00000029  cg02021817  38
cg00000029  cg02851944  13
cg00000029  cg02976952  238
cg00000029  cg03943270  93
cg00000029  cg07396495  604
cg00000029  cg12190057  929

And my second file is this one, with the position in the genome and one expression value in each column, for each sample ( 1 to 6)

TargetID    sample1 sample2 sample3 sample4 sample5 sample6
cg00000029  0.157   0.444   0.466   0.805   0.5489  0.448
cg01016459  0.873   0.930   0.926   0.942   0.932   0.9128  
cg03943270  0.871   0.920   0.926   0.942   0.942   0.942

In fact I have 100 samples. My idea is to get a final file for each sample with the expression values instead the cg's and the distance. For example, for sample 1

0.157  0.873 848
0.157  0.871  93

for sample 2

0.444   0.930 848
0.444   0.920   93

In PERL I have no problems when I got only two samples, I load the files in two estructures, hashes of arrays, and then I compare them using nested foreach loops, but it take so much time only for 2 samples, imagine 100! I tried in R, loading the data in 2 data frames and use something as

expression[rownames(expression) %in% rownames(distances),]

the problem is that I need something like a loop or apply function to iterate over the expression data using the first cpg value and then the second , if they are in pairs in expression, put the expression values and the distances.

Any ideas would be welcome

Thanks in advance

`

Upvotes: 0

Views: 991

Answers (2)

Vincent Zoonekynd
Vincent Zoonekynd

Reputation: 32351

You can join the two data.frames with merge, convert the result to the tall format with melt, and then apply a function (e.g., to save to a file) on each piece of the result with d_pply.

# Sample data
n <- length(LETTERS)
d1 <- cbind( expand.grid( LETTERS, LETTERS ), rnorm( n*n ) )
names(d1) <- c("id1", "id2", "distance")
d1 <- d1[ as.character(d1$id1) < as.character(d1$id2), ]
d2 <- as.data.frame( matrix( rnorm(n*6), nr=n ) )
d2 <- data.frame( id=LETTERS, d2 )
names( d2 )[-1] <- paste( "sample", 1:6, sep="")

# If the distance data.frame only contains half the pairs,
# i.e., if it only contains one of (a,b) and (b,a), 
# add the missing ones.    
d1a <- d1
d1b <- d1[,c(2,1,3)]
names(d1b) <- names(d1a)
d1 <- rbind( d1a, d1b )
d1 <- d1[ ! duplicated( d1[,1:2]), ]

# Merge the two data.frames    
d <- merge( d1, d2, by.x="id1", by.y="id" )

# Convert to tall format
library(reshape2)
d <- melt(d, id.vars=c("id1", "id2", "distance"))

# Apply a function to each chunk
d_ply( d, "variable", function (u) { 
  cat( "Would save ", nrow(u), " rows to ", as.character(u$variable[1]), "\n" ) 
} )

Upvotes: 0

Justin
Justin

Reputation: 43255

if your first data is in dat

structure(list(V1 = c("cg00000029", "cg00000029", "cg00000029", 
"cg00000029", "cg00000029", "cg00000029", "cg00000029"), V2 = c("cg01016459", 
"cg02021817", "cg02851944", "cg02976952", "cg03943270", "cg07396495", 
"cg12190057"), V3 = c(848L, 38L, 13L, 238L, 93L, 604L, 929L)), .Names = c("V1", 
"V2", "V3"), class = "data.frame", row.names = c(NA, -7L))

and second set is in target

structure(list(TargetID = c("cg00000029", "cg01016459", "cg03943270"
), sample1 = c(0.157, 0.873, 0.871), sample2 = c(0.444, 0.93, 
0.92), sample3 = c(0.466, 0.926, 0.926), sample4 = c(0.805, 0.942, 
0.942), sample5 = c(0.5489, 0.932, 0.942), sample6 = c(0.448, 
0.9128, 0.942)), .Names = c("TargetID", "sample1", "sample2", 
"sample3", "sample4", "sample5", "sample6"), class = "data.frame", row.names = c(NA, 
-3L))

match() will get you what you're looking for. I would use reshape and plyr packages. Specifically melt and ddply but I'm sure there is a apply version too.

target.melt <- melt(target,id.var='TargetID')

my.func <- function(lookup,df) {
  cg.one <- lookup$value[match(df$V1,lookup$TargetID)]
  cg.two <- lookup$value[match(df$V2,lookup$TargetID)]

  return(list(cgone=cg.one,cgtwo=cg.two,distance=df$V3))
}

out <- dlply(target.melt,.(variable),my.func,df=dat)

there are a bunch of NAs with your data since the second data set is incomplete but what you asked for is there:

> na.omit(as.data.frame(out[[1]]))
  cgone cgtwo distance
1 0.157 0.873      848
5 0.157 0.871       93
> 

Upvotes: 2

Related Questions