Reputation: 2127
I'm looking to add a column to a data frame (integrates2) that counts duplicates in sequence. Below is what the data looks like:
name program date of contact helper column
John ffp 10/11/2014 2
John TP 10/27/2014 2
Carlos TP 11/19/2015 3
Carlos ffp 12/1/2015 3
Carlos wfd 12/31/2015 3
Jen ffp 9/9/2014 2
Jen TP 9/30/2014 2
This is a list of people who've attended certain programs on certain dates. I've added a helper column to count duplicates and sorted the date of contact. I am looking to count the combinations of programs that exist (e.g. ffp-tp, tp-ffp-wfd).
In order to do this I want to implement the following code in order to transpose the ordered combinations with the help of a new column named "program2":
#transpose the programs
require(reshape2) dcast(integrates2, name ~ program2, value.var=”program”)
Then I plan to use the following code to turn the result into a table and data frame and count frequencies:
res = table(integrates2)
resdf = as.data.frame(res)
I saw this used in the following link: Count number of time combination of events appear in dataframe columns ext
What I need from "program2" is to look like this:
Name program date of contact helper column program2
John ffp 10/11/2014 2 1
John TP 10/27/2014 2 2
Carlos TP 11/19/2015 3 1
Carlos ffp 12/1/2015 3 2
Carlos wfd 12/31/2015 3 3
This way, I can use "program2" to transpose into different columns and then count the combinations. The final result should look something like this:
program pro1 pro2 freq
ffp tp 2
TP ffp wfd 1
I'm sure there are easier ways to do this, but as I am learning, this is where I am. Appreciate the help guys!
Upvotes: 4
Views: 827
Reputation: 23574
After thinking about this question, I think the following would be the way to go. If you do not mind combining all program names, you could do the following. This is probably much better.
setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
list(total = .N), by = type]
# type total
#1: ffp-TP 2
#2: TP-ffp-wfd 1
If you want to separate program names, you can do that with cSplit()
from the splitstackshape
package.
setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
list(total = .N), by = type] -> temp
cSplit(temp, splitCols = "type", sep = "-")
# total type_1 type_2 type_3
#1: 2 ffp TP NA
#2: 1 TP ffp wfd
The equivalence of dplyr code is:
group_by(mydf, name) %>%
summarise(type = paste(program, collapse = "-")) %>%
count(type)
# type n
# (chr) (int)
#1 ffp-TP 2
#2 TP-ffp-wfd 1
DATA
mydf <- structure(list(name = c("John", "John", "Carlos", "Carlos", "Carlos",
"Jen", "Jen"), program = c("ffp", "TP", "TP", "ffp", "wfd", "ffp",
"TP"), dateOfContact = c("10/11/2014", "10/27/2014", "11/19/2015",
"12/1/2015", "12/31/2015", "9/9/2014", "9/30/2014"), helperColumn = c(2L,
2L, 3L, 3L, 3L, 2L, 2L)), .Names = c("name", "program", "dateOfContact",
"helperColumn"), class = "data.frame", row.names = c(NA, -7L))
Upvotes: 2
Reputation: 43334
With dplyr
,
library(dplyr)
integrates2 %>% group_by(name) %>% summarise(prg1 = program[1],
prg2 = program[2],
prg3 = program[3]) %>%
select(prg1, prg2, prg3) %>% group_by(prg1, prg2, prg3) %>% summarise(freq = n())
returns
Source: local data frame [2 x 4]
Groups: prg1, prg2 [?]
prg1 prg2 prg3 freq
(fctr) (fctr) (fctr) (int)
1 ffp TP NA 2
2 TP ffp wfd 1
With mydf2
from the comments, it produces
Source: local data frame [3 x 4]
Groups: prg1, prg2 [?]
prg1 prg2 prg3 freq
(chr) (chr) (chr) (int)
1 ffp TP NA 1
2 TP ffp NA 1
3 wfd TP ffp 1
The chain
group_by
on name
to separates cases; summarise
to transpose program
into three columns; select
to narrow to those columns; group_by
all prg*
columns sosummarise
can chop them down to unique groups and add a freq
count of occurrences from those groups.Alternately, you can do the whole thing in base R if you like, though it's considerably less readable (with this particular approach, at least):
tab <- table(sapply(split(integrages2$program, integrates2$name),
function(x){paste(x, collapse = '-')}))
prgs <- strsplit(names(tab), '-')
programs <- do.call(rbind, lapply(prgs, function(x){
c(x, rep(NA, max(sapply(prgs, length)-length(x))))
}))
programs <- cbind(as.data.frame(programs), matrix(tab))
names(programs) <- c(paste0('prgm', seq(length(programs)-1)), 'freq')
A very quick-and-dirty version, which collapses the series into strings:
table(sapply(split(integrates2$program, integrates2$name),
function(x){paste(x, collapse = '-')}))
returns
ffp-TP TP-ffp-wfd
2 1
or if wrapped in as.matrix
,
[,1]
ffp-TP 2
TP-ffp-wfd 1
Using reshape2
, you can use dcast
to make a data.frame of program combinations (chopping out name
s, which we don't care about, with [,-1]
):
library(reshape2)
programs <- dcast(integrates2, name ~ program, value.var = 'program')[,-1]
programs
looks like:
> programs
ffp TP wfd
1 ffp TP wfd
2 ffp TP <NA>
3 ffp TP <NA>
You can now use dplyr
to group by all the column names of programs
(done programmatically here, but you can do it by hand with group_by(ffp, TP, wfd)
if you prefer to see what's going on) and summarise
, using n()
to grab counts of how many rows are in the groups:
library(dplyr)
programs %>% group_by_(.dots = names(programs)) %>% summarise(freq = n())
which returns
Source: local data frame [2 x 4]
Groups: ffp, TP [?]
ffp TP wfd freq
(chr) (chr) (chr) (int)
1 ffp TP wfd 1
2 ffp TP NA 2
Upvotes: 0