merge and plot multiple text files

Question

I have sixty text files, each with two columns as shown below, each representing a unique sample, and headed 'Coverage' and 'counts'. The length of each file differs by a few rows, because for some values of Coverage, the Count is zero, therefore not printed. Each file is about 1000 rows long. Each file is named in the format "B001.BaseCovDist.txt" to "B060.BaseCovDist.txt", and in R I have them as "B001" to "B060".

How can I combine the data frames by Coverage? This is complicated by missing rows. I've tried various approaches in bash, base R, reshape(2), and dplyr.
How can I make a single graph of the Counts(y-axis) against Coverage (x-axis) with each unique sample as a different series. Ggplot2 seems ideal but I seem to need a loop or a list to add the series without having to type out all of the names in full (which would be ridiculous).

One approach that seemed good was to add a third column that contains the unique sample name because this creates a molten dataset. However this didn't work in bash (awk) because the number of whitespace delimiters varies by row.

Any help would be very welcome.

  Coverage   Count
1        0 7089359
2        1  983611
3        2  658253
4        3  520767
5        4  448916
6        5  400904

r2evans · Accepted Answer

A good starting point is to consider a long-format for the data vice a wide-format. Since you mentioned reshape2, this should make sense, but check out tidyr as well, as the docs for both document the differences between long/wide.

Going with a long format, try the following:

allfiles <- lapply(list.files(pattern='foo.csv'),
                   function(fname) cbind(fname=fname, read.csv(fname)))
dat <- rbind_all(allfiles)
dat
##                  fname Coverage   Count
## 1 B001.BaseCovDist.txt        0 7089359
## 2 B001.BaseCovDist.txt        1  983611
## 3 B001.BaseCovDist.txt        2  658253
## 4 B001.BaseCovDist.txt        3  520767
## 5 B001.BaseCovDist.txt        4  448916
## 6 B001.BaseCovDist.txt        5  400904

ggplot(data=dat, aes(x=Coverage, y=Count, group=fname)) + geom_line()

merge and plot multiple text files

Answers (2)

Related Questions