Reputation: 169
I'm trying to combine the info from a number of data frames to fill one big data frame. The first dataframe is like an overview of all human genes:
gene_id chromosome gene_start gene_end
ENSG00000116396 1 110753965 110825722
ENSG00000228217 1 118320709 118321128
ENSG00000261716 1 149816065 149820591
ENSG00000223562 1 211355498 211356446
ENSG00000239859 1 36171626 36171875
ENSG00000197921 1 2460184 2461684
ENSG00000232237 1 201083081 201096312
ENSG00000212257 1 65488651 65488757
ENSG00000158887 1 161274525 161279762
ENSG00000238122 1 108803818 108816311
ENSG00000215846 1 159246293 159247282
ENSG00000266763 1 26238240 26238313
ENSG00000228634 1 32398621 32399576
ENSG00000177614 1 230457392 230561475
ENSG00000163462 1 155145873 155157447
ENSG00000204481 1 13668269 13673511
Secondly, I have a number of files (516 of them), containing data like this:
Sample Chromosome Start End Num_Probes Segment_Mean
UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 61735 82170 9 0.2560
UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 82315 16869363 8678 -0.1199
UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 16871278 17087292 85 -0.5386
UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 17089349 17209603 23 -0.0807
UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 17210652 17262232 57 0.2680
UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 17262247 25583341 5240 -0.1228
UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 25593128 25646986 28 -1.8216
UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 25661501 30738534 2398 -0.0942
UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 30739299 30745210 7 -1.3117
Now, I would like to create some sort of loop, to take each gene in the first data frame. Then I would like to loop through all 516 data frames while checking the place of the gene.
So for each gene and each file, I would like to compare the start and end of the gene with the start and end of each sample segment, provided that they are on the same chromosome. If this is the case, I want to take the segment mean and put it in one big new dataframe with the gene_id's as rownames and the names of the files as column names.
This is the code I already have:
for(gene in 1:100){
gene_id <- genome[gene, 1]
chromosome <- genome[gene, 2]
gene_start <- genome[gene, 3]
gene_end <- genome[gene, 4]
for(name in 1:length(dataframe_names)){
df <- get(dataframe_names[name])
for(segment in 1:nrow(df)){
if(chromosome == as.character(df[segment,2])){
if(gene_start > df[segment,3] && gene_start < df[segment,4] && gene_end > df[segment,3] && gene_end < df[segment,4]){
data_matrix[gene,name] <- df[segment, 6]
}
}
}
}
}
This code works, but it's really slow, considering that there are 57 773 genes. A test run with 100 genes took 1 hour, so a complete run would probably take 2-3 weeks...
I think using the apply
-family will speed things up, but I've never used them before so I don't really know how to do it. Examples on the internet always use things like sum
or mean
, but I don't want to these kind of things, I just want to compare the numbers. Also, I don't know which one of the apply
-family suits my needs best.
You guys wanna help me out or get on the right track?
Upvotes: 0
Views: 144
Reputation: 3525
First thing you need to do is stop reinventing the wheel. Get familiar with the bioconductor packages IRanges
, GenomicRanges
(and Biostrings
for completeness, although not for this particular problem). Once you get GenomicRanges look into the findOverlaps family of functions.
Since your exmple data doesn't actually have any overlaps I modified it like so
df1 <- structure(list(gene_id = structure(c(2L, 5L, 1L, 3L, 7L, 6L,
4L), .Label = c("ENSG00000207157", "ENSG00000223116", "ENSG00000229483",
"ENSG00000232849", "ENSG00000233440", "ENSG00000235205", "ENSG00000252952"
), class = "factor"), chromosome = c(13L, 13L, 13L, 13L, 13L,
13L, 13L), gene_start = c(23551994L, 23708313L, 23726725L, 23743974L,
23791571L, 23817659L, 93708910L), gene_end = c(23552136L, 23708703L,
23726825L, 23744736L, 23791673L, 23821323L, 93710179L)), .Names = c("gene_id",
"chromosome", "gene_start", "gene_end"), class = "data.frame", row.names = c(NA,
-7L))
df2 <- structure(list(Sample = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930", class = "factor"),
Chromosome = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Start = c(61735L,
82315L, 16871278L, 17089349L, 17210652L, 17262247L, 25593128L,
25661501L, 30739299L), End = c(82170L, 16869363L, 17087292L,
17209603L, 17262232L, 25583341L, 25646986L, 30738534L, 30745210L
), Num_Probes = c(9L, 8678L, 85L, 23L, 57L, 5240L, 28L, 2398L,
7L), Segment_Mean = c(0.256, -0.1199, -0.5386, -0.0807, 0.268,
-0.1228, -1.8216, -0.0942, -1.3117)), .Names = c("Sample",
"Chromosome", "Start", "End", "Num_Probes", "Segment_Mean"), class = "data.frame", row.names = c(NA,
-9L))
df1$chromosome <- 1
df1[6,4] <- 30745215 # to show what happens when there are multiple overlaps
Now to get the overlaps
library(GenomicRanges)
Genes <- GRanges(df1$chromosome,
IRanges(df1$gene_start, df1$gene_end), genes=df1$gene_id)
# make a list of the 512 others, and read 1 one for example
files <- list.files(pattern="csv") # assuming they are .csv files
snps0 <- read.csv(files[[1]])
snps <- GRanges(snps0$Chromosome, IRanges(snps0$Start, snps0$End),
Segment_Mean=snps0$Segment_Mean)
olaps <- findOverlaps(query=snps, subject=Genes)
Once you have the overlaps you can use that as a basis to merge the original data frames
olaps2 <- as.data.frame(olaps)
df1$Row <- rownames(df1)
NewDF <- merge(df1,olaps2,by.x="Row",by.y="subjectHits",all=T,sort=F)
df2$Row <- rownames(df2)
NewDF2 <- merge(NewDF,df2,by.x="queryHits",by.y="Row",all.x=T,sort=F)[,c(-1,-2)]
# drop the first 2 columns because they were just temporary for merging purposes
head(NewDF2,2)
gene_id chromosome gene_start gene_end
1 ENSG00000223116 1 23551994 23552136
2 ENSG00000207157 1 23726725 23726825
Sample Chromosome Start
1 UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 17262247
2 UNDID_p_TCGA_353_354_355_37_NSP_GenomeWideSNP_6_H10_1376930 1 17262247
End Num_Probes Segment_Mean
1 25583341 5240 -0.1228
2 25583341 5240 -0.1228
Upvotes: 3