How to add a column that counts duplicates in sequence?

Question

I'm looking to add a column to a data frame (integrates2) that counts duplicates in sequence. Below is what the data looks like:

name    program  date of contact   helper column
John     ffp        10/11/2014          2
John     TP         10/27/2014          2
Carlos   TP         11/19/2015          3
Carlos   ffp        12/1/2015           3
Carlos   wfd        12/31/2015          3
Jen      ffp        9/9/2014            2
Jen      TP         9/30/2014           2

This is a list of people who've attended certain programs on certain dates. I've added a helper column to count duplicates and sorted the date of contact. I am looking to count the combinations of programs that exist (e.g. ffp-tp, tp-ffp-wfd).

In order to do this I want to implement the following code in order to transpose the ordered combinations with the help of a new column named "program2":

 #transpose the programs 
 require(reshape2) dcast(integrates2, name ~ program2, value.var=”program”)

Then I plan to use the following code to turn the result into a table and data frame and count frequencies:

 res = table(integrates2)
 resdf = as.data.frame(res)

I saw this used in the following link: Count number of time combination of events appear in dataframe columns ext

What I need from "program2" is to look like this:

  Name    program  date of contact   helper column   program2
  John     ffp        10/11/2014          2             1
  John     TP         10/27/2014          2             2
  Carlos   TP         11/19/2015          3             1
  Carlos   ffp        12/1/2015           3             2
  Carlos   wfd        12/31/2015          3             3

This way, I can use "program2" to transpose into different columns and then count the combinations. The final result should look something like this:

    program  pro1   pro2   freq      
     ffp     tp             2   
     TP      ffp    wfd     1

I'm sure there are easier ways to do this, but as I am learning, this is where I am. Appreciate the help guys!

jazzurro · Accepted Answer

After thinking about this question, I think the following would be the way to go. If you do not mind combining all program names, you could do the following. This is probably much better.

setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
           list(total = .N), by = type]

#         type total
#1:     ffp-TP     2
#2: TP-ffp-wfd     1

If you want to separate program names, you can do that with cSplit() from the splitstackshape package.

setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
              list(total = .N), by = type] -> temp

cSplit(temp, splitCols = "type", sep = "-")

#   total type_1 type_2 type_3
#1:     2    ffp     TP     NA
#2:     1     TP    ffp    wfd

The equivalence of dplyr code is:

group_by(mydf, name) %>%
summarise(type = paste(program, collapse = "-")) %>%
count(type)

#        type     n
#       (chr) (int)
#1     ffp-TP     2
#2 TP-ffp-wfd     1

DATA

mydf <- structure(list(name = c("John", "John", "Carlos", "Carlos", "Carlos", 
"Jen", "Jen"), program = c("ffp", "TP", "TP", "ffp", "wfd", "ffp", 
"TP"), dateOfContact = c("10/11/2014", "10/27/2014", "11/19/2015", 
"12/1/2015", "12/31/2015", "9/9/2014", "9/30/2014"), helperColumn = c(2L, 
2L, 3L, 3L, 3L, 2L, 2L)), .Names = c("name", "program", "dateOfContact", 
"helperColumn"), class = "data.frame", row.names = c(NA, -7L))

How to add a column that counts duplicates in sequence?

Answers (2)

Edit: Return permutations

Pre-edit version: Return combinations

Related Questions