nouse
nouse

Reputation: 3461

Creating a count matrix from factor level occurences in a list of dataframes

Since i cannot give example data, here are two small textfiles representing the first 5 lines of two of my input files:

https://www.dropbox.com/sh/s0rmi2zotb3dx3o/AAAq0G3LbOokfN8MrYf7jLofa?dl=0

I read all textfiles in the working directory into a list, cut some columns, set new names and subset by a numerical cutoff in the third column:

all.files <- list.files(pattern = ".*.txt")
data.list <- lapply(all.files, function(x)read.table(x, sep="\t"))
names(data.list) <- all.files
data.list <- lapply(data.list, function(x) x[,1:3])

new.names<-c("query", "sbjct", "ident")

data.list <- lapply(data.list, setNames, new.names)
new.list <- lapply(data.list, function(x) subset(x, ident>99))

I am ending up with a list of dataframes, which consist of three columns each.

Now, i want to

  1. count the occurences of factors in the column "sbjct" in all dataframes in the list, and
  2. build a matrix from the counts, in which rows=factor levels of "sbjct" and columns=occurences in each dataframe.

For each dataframe in the list, a new object with two columns (sbjct/counts) should be created named according to the original dataframe in the original list. In the end, all the new objects should be merged with cbind (for example), and empty cells (data absent) should be filled with zeros, resulting in a "sbjct x counts" matrix.

For example, if i would have a single dataframe, dplyr would help me like this:

library(dplyr)
some.object <- some.dataframe %>% 
                  group_by(sbjct) %>%
                    summarise(counts = length(sbjct))

>some.object
Source: local data frame [5 x 2]

            sbjct counts
1 AB619702.1.1454       1
2 EU287121.1.1497       1
3 HM062118.1.1478       1
4 KC437137.1.1283       1
5        Yq2He155       1

But it seems it cannot be applied to lists of dataframes.

Upvotes: 0

Views: 1182

Answers (1)

Lalit Sachan
Lalit Sachan

Reputation: 78

Add a column to each data set which acts as indicator [lets name that Ndata] that the particular observation is coming from that dataset. Now rbind all these data sets.

Now when you make a cross table of sbjct X Ndata , you'll get the matrix that you are looking for.

here is some code to clarify:

t=c("a","b","c","d","e","f")
set.seed(10)
d1=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d2=data.frame(sbjt=sample(t,sample(20,1),rep=T))
d3=data.frame(sbjt=sample(t,sample(20,1),rep=T))

d1$Ndata=rep("d1",nrow(d1))
d2$Ndata=rep("d2",nrow(d2))
d3$Ndata=rep("d3",nrow(d3))

all=rbind(d1,d2,d3)

ct=table(all$sbjt,all$Ndata)

ct looks like this:

> ct

    d1 d2 d3
  a  1  0  0
  b  4  0  1
  c  2  2  1
  d  3  1  0
  e  1  0  0
> 

Upvotes: 1

Related Questions