Reputation: 3461
Since i cannot give example data, here are two small textfiles representing the first 5 lines of two of my input files:
https://www.dropbox.com/sh/s0rmi2zotb3dx3o/AAAq0G3LbOokfN8MrYf7jLofa?dl=0
I read all textfiles in the working directory into a list, cut some columns, set new names and subset by a numerical cutoff in the third column:
all.files <- list.files(pattern = ".*.txt")
data.list <- lapply(all.files, function(x)read.table(x, sep="\t"))
names(data.list) <- all.files
data.list <- lapply(data.list, function(x) x[,1:3])
new.names<-c("query", "sbjct", "ident")
data.list <- lapply(data.list, setNames, new.names)
new.list <- lapply(data.list, function(x) subset(x, ident>99))
I am ending up with a list of dataframes, which consist of three columns each.
Now, i want to
For each dataframe in the list, a new object with two columns (sbjct/counts) should be created named according to the original dataframe in the original list. In the end, all the new objects should be merged with cbind (for example), and empty cells (data absent) should be filled with zeros, resulting in a "sbjct x counts" matrix.
For example, if i would have a single dataframe, dplyr would help me like this:
library(dplyr)
some.object <- some.dataframe %>%
group_by(sbjct) %>%
summarise(counts = length(sbjct))
>some.object
Source: local data frame [5 x 2]
sbjct counts
1 AB619702.1.1454 1
2 EU287121.1.1497 1
3 HM062118.1.1478 1
4 KC437137.1.1283 1
5 Yq2He155 1
But it seems it cannot be applied to lists of dataframes.
Upvotes: 0
Views: 1182
Reputation: 78
Add a column to each data set which acts as indicator [lets name that Ndata] that the particular observation is coming from that dataset. Now rbind all these data sets.
Now when you make a cross table of sbjct X Ndata , you'll get the matrix that you are looking for.
here is some code to clarify:
t=c("a","b","c","d","e","f")
set.seed(10)
d1=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d2=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d3=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d1$Ndata=rep("d1",nrow(d1))
d2$Ndata=rep("d2",nrow(d2))
d3$Ndata=rep("d3",nrow(d3))
all=rbind(d1,d2,d3)
ct=table(all$sbjt,all$Ndata)
ct looks like this:
> ct
d1 d2 d3
a 1 0 0
b 4 0 1
c 2 2 1
d 3 1 0
e 1 0 0
>
Upvotes: 1