martin_m
martin_m

Reputation: 137

From contingency tables to data.frame in R

My starting point is having several character vectors containing POS tags I extracted from texts. For example:

c("NNS", "VBP", "JJ",  "CC",  "DT")
c("NNS", "PRP", "JJ",  "RB",  "VB")

I use table() or ftable() to count the occurences of each tag.

 CC  DT  JJ NNS VBP 
 1   1   1   1   1

The ultimate goal is to have a data.frame looking like this:

   NNS VBP PRP JJ CC RB DT VB
1  1   1   0   1  1  0  1  0
2  1   0   1   1  0  1  0  1 

Using plyr::rbind.fill seems reasonable to me here, but it needs data.frame objects as inputs. However, when using as.data.frame.matrix(table(POS_vector)) an error occurs.

Error in seq_len(ncols) : 
argument must be coercible to non-negative integer

Using as.data.frame.matrix(ftable(POS_vector)) actually produces a data.frame, but without the colnames.

V1 V2 V3 V4 V5 ...
1  1  1  1  1

Any help is highly appreciated.

Upvotes: 1

Views: 989

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

In base R, you can try:

table(rev(stack(setNames(dat, seq_along(dat)))))

You can also use mtabulate from "qdapTools":

library(qdapTools)
mtabulate(dat)
#   CC DT JJ NNS PRP RB VB VBP
# 1  1  1  1   1   0  0  0   1
# 2  0  0  1   1   1  1  1   0

dat is the same as defined in @Heroka's answer:

dat <- list(c("NNS", "VBP", "JJ",  "CC",  "DT"),
            c("NNS", "PRP", "JJ",  "RB",  "VB"))

Upvotes: 3

Heroka
Heroka

Reputation: 13139

It's probably a bit of a workaround, but this might be a solution.

We assume all our vectors are in a list:

dat <- list(c("NNS", "VBP", "JJ",  "CC",  "DT"),
c("NNS", "PRP", "JJ",  "RB",  "VB"))

Then we transform our table to a transposed matrix, which we convert to a data.table:

library(data.table)
temp <- lapply(dat,function(x){
  data.table(t(as.matrix(table(x))))
})

Then we use rbindlist to create the desired output:

rbindlist(temp,fill=T)

We can also choose to put all our data in a data.table first, and then do the aggregating. Note that this assumes equal vector lengths.

temp <- as.data.table(dat)
#turn to long format
temp_m <- melt(temp, measure.vars=colnames(temp))

#count values for each variable/value-combination, then reshape to wide
res <- dcast(temp_m[,.N,by=.(variable,value)], variable~value,value.var="N", fill=0)

Upvotes: 2

Related Questions