Reputation: 5169
I have a data frame with Category
and pd
. I need to calculate how many times each meaningful word from all pd
is seen in each Category
. I'm stuck with the last step - summary. Ideally, ratio of that frequency to total length of pd
by Category
would be another X columns.
Example:
freq = structure(list(Category = c("C1", "C2"
), pd = c("96 oz, epsom salt 96 oz, epsom bath salt",
"17 x 24 in, bath mat")), .Names = c("Category", "pd"), row.names = c(NA,
-2L), class = "data.frame")
pool = sort(unique(gsub("[[:punct:]]|[0-9]","", unlist(strsplit(freq[,2]," ")))))
pool = pool[nchar(pool)>1]
freq
:
Category pd
1 C1 96 oz, epsom salt 96 oz, epsom bath salt
2 C2 17 x 24 in, bath mat
pool
:
[1] "bath" "epsom" "in" "mat" "oz" "salt"
Desired output:
pool C1freq C1ratio C2freq C2ratio
bath 1 1/7 1 1/3
epsom 2 2/7 0 0
in 0 0 1 1/3
mat 0 0 1 1/3
oz 2 2/7 0 0
salt 2 2/7 0 0
Where, e.g. 7
is the length of C1[,2]
with numbers, punctuation removed (as in pool
rules). 1/7
of course is not necessary in this form - here it is just to show the denominator length.
If possible, w/o dplyr
or qdap
. Thanks!!
Upvotes: 1
Views: 115
Reputation: 887501
We could try
library(qdapTools)
library(stringr)
lst <- str_extract_all(freq$pd, '[A-Za-z]{2,}')
m1 <- t(mtabulate(lst))
m2 <- prop.table(m1,2)
cbind(m1, m2)[,c(1,3,2,4)]
Or without the qdapTools
,
Un1 <- sort(unique(unlist(lst)))
m1 <- do.call(cbind, lapply(lst, function(x)
table(factor(x, levels=Un1))))
colnames(m1) <- freq$Category
cbind(m1, `colnames<-`(prop.table(m1,2), paste0(colnames(m1), 'Prop')))
Upvotes: 2
Reputation: 193637
You can consider adapting your current approach in the following manner:
tab <- table(
stack(
setNames(
lapply(strsplit(gsub("[[:punct:]]|[0-9]", "", freq$pd), "\\s+"),
function(x) x[nchar(x) > 1]), freq$Category)))
Notice here that I used gsub
first, rather than after splitting. Then, I split on a space, and filtered the data in the same manner that you had filtered it. Finally, I used setNames
so that I would be able to use stack
to get a long data.frame
that could be tabulated.
Once the data are tabulated, just use prop.table
to get your desired output.
cbind(tab, prop.table(tab, 2))
# C1 C2 C1 C2
# bath 1 1 0.1428571 0.3333333
# epsom 2 0 0.2857143 0.0000000
# in 0 1 0.0000000 0.3333333
# mat 0 1 0.0000000 0.3333333
# oz 2 0 0.2857143 0.0000000
# salt 2 0 0.2857143 0.0000000
Upvotes: 1