stephanhart
stephanhart

Reputation: 31

merge multiple files with different rows in R

I know that this question has been asked previously, but answers to the previous posts cannot seem to solve my problem.

I have dozens of tab-delimited .txt files. Each file has two columns ("pos", "score"). I would like to compile all of the "score" columns into one file with multiple columns. The number of rows in each file varies and they are irrelevant for the compilation.

If someone could direct me on how to accomplish this, preferably in R, it would be a lot of helpful.

Alternatively, my ultimate goal is to read the median and mean of the "score" column from each file. So if this could be accomplished, with or without compiling the files, it would be even more helpful.

Thanks.

UPDATE:

As appealing as the idea of personal code ninjas is, I understand this will have to remain a fantasy. Sorry for not being explicit.

I have tried lapply and Reduce, e.g.,

> files <- dir(pattern="X.*\\.txt$")
> File_list <- lapply(filesToProcess,function(score)
+  read.table(score,header=TRUE,row.names=1))
> File_list <- lapply(files,function(z) z[c("pos","score")])
> out_file <- Reduce(function(x,y) {merge(x,y,by=c("pos"))},File_list)

which I know doesn't really make sense, considering I have variable row numbers. I have also tried plyr

> files <- list.files()
> out_list <- llply(files,read.table)

As well as cbind and rbind. Usually I get an error message, because the row numbers don't match up or I just get all the "score" data compiled into one column.

The advice on similar posts (e.g., Merging multiple csv files in R, Simultaneously merge multiple data.frames in a list, and Merge multiple files in a list with different number of rows) has not been helpful.

I hope this clears things up.

Upvotes: 2

Views: 3053

Answers (2)

Victor K.
Victor K.

Reputation: 4094

This problem could be solved in two steps:

Step 1. Read the data from your csv files into a list of data frames, where files is a vector of file names. If you need to add extra arguments to read.csv, add them like shown below. See ?lapply for details.

list_of_dataframes <- lapply(files, read.csv, stringsAsFactors = FALSE)

Step 2. Calculate means for each data frame:

means <- sapply(list_of_dataframes, function(df) mean(df$score))

Of course, you can always do it in one step like this:

means <- sapply(files, function(filename) mean(read.csv(filename)$score))

Upvotes: 1

eddi
eddi

Reputation: 49448

I think you want smth like this:

all_data = do.call(rbind, lapply(files,
                                 function(f) {
                                   cbind(read.csv(f), file_name=f)
                                 }))

You can then do whatever "by" type of action you like. Also, don't forget to adjust the various read.csv options to suit your needs.

E.g. once you have the above, you can do the following (and much more):

library(data.table)
dt = data.table(all_data)

dt[, list(mean(score), median(score)), by = file_name]

A small note: you could also use data.table's fread, to read the files in instead of the read.table and its derivatives, and that would be much faster, and while we're at it, use rbindlist instead of do.call(rbind,.

Upvotes: 0

Related Questions