Reputation: 621
I am using ArrayExpress dataset to build a dataframe, so that I can run in gene pattern.
In my folder, GSE11000, there is a bunch of files, which file name is in this patter,
GSM123445_samples_table.txt
GSM129995_samples_table.txt
Inside each file, the table is in this pattern
Identifier VALUE
10001 0.12323
10002 0.11535
I have a dataframe, clinical_data, that include all the file I want, which is in this pattern
Data.File Samples OS.event
1 GSM123445_samples_table.txt GSM123445 0
2 GSM129995_samples_table.txt GSM129995 0
3 GSM129999_samples_table.txt GSM129999 1
4 GSM130095_samples_table.txt GSM130095 1
I want to create a dataframe which should like this
Identifier GSM123445 GSM129995 GSM129999 GSM130095
1 10001 0.12323 0.14523 0.22387 0.56233
2 10002 0.11535 0.39048 0.23437 -0.12323
3 10006 0.12323 0.35634 0.12237 -0.12889
4 10008 0.11535 0.23454 0.21227 0.90098
This is my code
library(dplyr)
setwd(.../GSE11000)
file_list <- clinical_data[, 1] # create a list that include Data.File
for (file in file_list){
if (!exists("dataset")){ # if dataset not exists, create one
dataset <- read.table(file, header=TRUE, sep="\t") #read txt file from folder
x <- unlist(strsplit(file, "_"))[1] # extract the GSMxxxxxx from the name of files
dataset <- rename(dataset, x = VALUE) # rename the column
}
else {
temp_dataset <- read.table(file, header=TRUE, sep="\t") # read file
x <- unlist(strsplit(file, "_"))[1]
temp_dataset <- rename(temp_dataset, x = VALUE)
dataset<-left_join(dataset, temp_dataset, "Reporter.Identifier")
rm(temp_dataset)
}
}
My outcome is this
Identifier x.x x.y x.x x.y
1 10001 0.12323 0.14523 0.22387 0.56233
2 10002 0.11535 0.39048 0.23437 -0.12323
3 10006 0.12323 0.35634 0.12237 -0.12889
4 10008 0.11535 0.23454 0.21227 0.90098
This is because the rename part had failed to work.
Anyone have any idea how can I solve this problem? and anyone can make my code more efficiency?
If you can tell me how to use bioconductor so that I can work with this data, I will be grateful too.
Upvotes: 0
Views: 70
Reputation: 160447
Similar to @jdobres but using dplyr
(and spread
):
First, to create some sample data files:
set.seed(42)
for (fname in sprintf("GSM%s_samples_table.txt", sample(10000, size = 4))) {
write.table(data.frame(Identifier = 10001:10004, VALUE = runif(4)),
file = fname, row.names = FALSE)
}
file_list <- list.files(pattern = "GSM.*")
file_list
# [1] "GSM2861_samples_table.txt" "GSM8302_samples_table.txt"
# [3] "GSM9149_samples_table.txt" "GSM9370_samples_table.txt"
read.table(file_list[1], skip = 1, col.names = c("Identifier", "VALUE"))
# Identifier VALUE
# 1 10001 0.9346722
# 2 10002 0.2554288
# 3 10003 0.4622928
# 4 10004 0.9400145
Now the processing:
library(dplyr)
library(tidyr)
mapply(function(fname, varname)
cbind.data.frame(Samples = varname,
read.table(fname, skip = 1, col.names = c("Identifier", "VALUE")),
stringsAsFactors = FALSE),
file_list, gsub("_.*", "", file_list), SIMPLIFY = FALSE) %>%
bind_rows() %>%
spread(Samples, VALUE)
# Identifier GSM2861 GSM8302 GSM9149 GSM9370
# 1 10001 0.9346722 0.9782264 0.6417455 0.6569923
# 2 10002 0.2554288 0.1174874 0.5190959 0.7050648
# 3 10003 0.4622928 0.4749971 0.7365883 0.4577418
# 4 10004 0.9400145 0.5603327 0.1346666 0.7191123
Upvotes: 2
Reputation: 11957
Hard to tell if this will work, since your example isn't quite reproducible, but here's how I'd tackle it.
First, read all of the data files into one large data frame, creating an extra column called "sample" which will hold your sample label.
library(plyr)
df <- ddply(clinical_data, .(Data.File), function(x) {
data.this <- read.table(x$Data.File, header=TRUE, sep="\t")
data.this$sample <- x$Samples
return(data.this)
})
Then use the tidyr::spread
function to create a new column for each "sample" with the values in the "VALUE" column.
library(tidyr)
df <- spread(df, sample, VALUE)
Upvotes: 0