Reputation: 67
I have sixty text files, each with two columns as shown below, each representing a unique sample, and headed 'Coverage' and 'counts'. The length of each file differs by a few rows, because for some values of Coverage, the Count is zero, therefore not printed. Each file is about 1000 rows long. Each file is named in the format "B001.BaseCovDist.txt" to "B060.BaseCovDist.txt", and in R I have them as "B001" to "B060".
How can I combine the data frames by Coverage? This is complicated by missing rows. I've tried various approaches in bash, base R, reshape(2), and dplyr.
How can I make a single graph of the Counts(y-axis) against Coverage (x-axis) with each unique sample as a different series. Ggplot2 seems ideal but I seem to need a loop or a list to add the series without having to type out all of the names in full (which would be ridiculous).
One approach that seemed good was to add a third column that contains the unique sample name because this creates a molten dataset. However this didn't work in bash (awk) because the number of whitespace delimiters varies by row.
Any help would be very welcome.
Coverage Count
1 0 7089359
2 1 983611
3 2 658253
4 3 520767
5 4 448916
6 5 400904
Upvotes: 2
Views: 442
Reputation: 67
Just to add to your answer, r2evans I added a gsub command so that the filename suffix is removed from the added column (and also some boring import modifers).
allfiles <- lapply(list.files(pattern='.BasCovDis.txt'), function(sample) cbind(sample=gsub("[.]BasCovDis.txt","", sample), read.table(sample, header=T, skip=3)))
Upvotes: 0
Reputation: 160687
A good starting point is to consider a long-format for the data vice a wide-format. Since you mentioned reshape2
, this should make sense, but check out tidyr
as well, as the docs for both document the differences between long/wide.
Going with a long format, try the following:
allfiles <- lapply(list.files(pattern='foo.csv'),
function(fname) cbind(fname=fname, read.csv(fname)))
dat <- rbind_all(allfiles)
dat
## fname Coverage Count
## 1 B001.BaseCovDist.txt 0 7089359
## 2 B001.BaseCovDist.txt 1 983611
## 3 B001.BaseCovDist.txt 2 658253
## 4 B001.BaseCovDist.txt 3 520767
## 5 B001.BaseCovDist.txt 4 448916
## 6 B001.BaseCovDist.txt 5 400904
ggplot(data=dat, aes(x=Coverage, y=Count, group=fname)) + geom_line()
Upvotes: 1