Creating a count matrix from factor level occurences in a list of dataframes

Question

Since i cannot give example data, here are two small textfiles representing the first 5 lines of two of my input files:

https://www.dropbox.com/sh/s0rmi2zotb3dx3o/AAAq0G3LbOokfN8MrYf7jLofa?dl=0

I read all textfiles in the working directory into a list, cut some columns, set new names and subset by a numerical cutoff in the third column:

all.files <- list.files(pattern = ".*.txt")
data.list <- lapply(all.files, function(x)read.table(x, sep="	"))
names(data.list) <- all.files
data.list <- lapply(data.list, function(x) x[,1:3])

new.names<-c("query", "sbjct", "ident")

data.list <- lapply(data.list, setNames, new.names)
new.list <- lapply(data.list, function(x) subset(x, ident>99))

I am ending up with a list of dataframes, which consist of three columns each.

Now, i want to

count the occurences of factors in the column "sbjct" in all dataframes in the list, and
build a matrix from the counts, in which rows=factor levels of "sbjct" and columns=occurences in each dataframe.

For each dataframe in the list, a new object with two columns (sbjct/counts) should be created named according to the original dataframe in the original list. In the end, all the new objects should be merged with cbind (for example), and empty cells (data absent) should be filled with zeros, resulting in a "sbjct x counts" matrix.

For example, if i would have a single dataframe, dplyr would help me like this:

library(dplyr)
some.object <- some.dataframe %>% 
                  group_by(sbjct) %>%
                    summarise(counts = length(sbjct))

>some.object
Source: local data frame [5 x 2]

            sbjct counts
1 AB619702.1.1454       1
2 EU287121.1.1497       1
3 HM062118.1.1478       1
4 KC437137.1.1283       1
5        Yq2He155       1

But it seems it cannot be applied to lists of dataframes.

Creating a count matrix from factor level occurences in a list of dataframes

Answers (1)

Related Questions