pill45
pill45

Reputation: 621

How to put characters into dataframe using dplyr::rename()?

I am using ArrayExpress dataset to build a dataframe, so that I can run in gene pattern.

In my folder, GSE11000, there is a bunch of files, which file name is in this patter,

GSM123445_samples_table.txt
GSM129995_samples_table.txt

Inside each file, the table is in this pattern

Identifier     VALUE
     10001   0.12323
     10002   0.11535

I have a dataframe, clinical_data, that include all the file I want, which is in this pattern

                     Data.File      Samples     OS.event
1  GSM123445_samples_table.txt    GSM123445            0
2  GSM129995_samples_table.txt    GSM129995            0
3  GSM129999_samples_table.txt    GSM129999            1
4  GSM130095_samples_table.txt    GSM130095            1

I want to create a dataframe which should like this

     Identifier  GSM123445  GSM129995  GSM129999  GSM130095
 1       10001     0.12323    0.14523    0.22387    0.56233
 2       10002     0.11535    0.39048    0.23437   -0.12323
 3       10006     0.12323    0.35634    0.12237   -0.12889
 4       10008     0.11535    0.23454    0.21227    0.90098

This is my code

library(dplyr)
setwd(.../GSE11000)
file_list <- clinical_data[, 1] # create a list that include Data.File
for (file in file_list){
  if (!exists("dataset")){     # if dataset not exists, create one
     dataset <- read.table(file, header=TRUE, sep="\t") #read txt file from folder
     x <- unlist(strsplit(file, "_"))[1] # extract the GSMxxxxxx from the name of files
     dataset <- rename(dataset, x = VALUE) # rename the column
  }     
  else {
     temp_dataset <- read.table(file, header=TRUE, sep="\t") # read file
     x <- unlist(strsplit(file, "_"))[1]
     temp_dataset <- rename(temp_dataset, x = VALUE)    
     dataset<-left_join(dataset, temp_dataset, "Reporter.Identifier")
     rm(temp_dataset)
  }
}

My outcome is this

     Identifier        x.x        x.y        x.x        x.y
 1       10001     0.12323    0.14523    0.22387    0.56233
 2       10002     0.11535    0.39048    0.23437   -0.12323
 3       10006     0.12323    0.35634    0.12237   -0.12889
 4       10008     0.11535    0.23454    0.21227    0.90098

This is because the rename part had failed to work.

Anyone have any idea how can I solve this problem? and anyone can make my code more efficiency?

If you can tell me how to use bioconductor so that I can work with this data, I will be grateful too.

Upvotes: 0

Views: 70

Answers (2)

r2evans
r2evans

Reputation: 160447

Similar to @jdobres but using dplyr (and spread):

First, to create some sample data files:

set.seed(42)
for (fname in sprintf("GSM%s_samples_table.txt", sample(10000, size = 4))) {
  write.table(data.frame(Identifier = 10001:10004, VALUE = runif(4)),
              file = fname, row.names = FALSE)
}
file_list <- list.files(pattern = "GSM.*")
file_list
# [1] "GSM2861_samples_table.txt" "GSM8302_samples_table.txt"
# [3] "GSM9149_samples_table.txt" "GSM9370_samples_table.txt"
read.table(file_list[1], skip = 1, col.names = c("Identifier", "VALUE"))
#   Identifier     VALUE
# 1      10001 0.9346722
# 2      10002 0.2554288
# 3      10003 0.4622928
# 4      10004 0.9400145

Now the processing:

library(dplyr)
library(tidyr)
mapply(function(fname, varname)
           cbind.data.frame(Samples = varname,
                            read.table(fname, skip = 1, col.names = c("Identifier", "VALUE")),
                            stringsAsFactors = FALSE),
       file_list, gsub("_.*", "", file_list), SIMPLIFY = FALSE) %>%
  bind_rows() %>%
  spread(Samples, VALUE)
#   Identifier   GSM2861   GSM8302   GSM9149   GSM9370
# 1      10001 0.9346722 0.9782264 0.6417455 0.6569923
# 2      10002 0.2554288 0.1174874 0.5190959 0.7050648
# 3      10003 0.4622928 0.4749971 0.7365883 0.4577418
# 4      10004 0.9400145 0.5603327 0.1346666 0.7191123

Upvotes: 2

jdobres
jdobres

Reputation: 11957

Hard to tell if this will work, since your example isn't quite reproducible, but here's how I'd tackle it.

First, read all of the data files into one large data frame, creating an extra column called "sample" which will hold your sample label.

library(plyr)

df <- ddply(clinical_data, .(Data.File), function(x) {
    data.this <- read.table(x$Data.File, header=TRUE, sep="\t")
    data.this$sample <- x$Samples
    return(data.this)
})

Then use the tidyr::spread function to create a new column for each "sample" with the values in the "VALUE" column.

library(tidyr)
df <- spread(df, sample, VALUE)

Upvotes: 0

Related Questions