LoF10
LoF10

Reputation: 2127

How to add a column that counts duplicates in sequence?

I'm looking to add a column to a data frame (integrates2) that counts duplicates in sequence. Below is what the data looks like:

name    program  date of contact   helper column
John     ffp        10/11/2014          2
John     TP         10/27/2014          2
Carlos   TP         11/19/2015          3
Carlos   ffp        12/1/2015           3
Carlos   wfd        12/31/2015          3
Jen      ffp        9/9/2014            2
Jen      TP         9/30/2014           2    

This is a list of people who've attended certain programs on certain dates. I've added a helper column to count duplicates and sorted the date of contact. I am looking to count the combinations of programs that exist (e.g. ffp-tp, tp-ffp-wfd).

In order to do this I want to implement the following code in order to transpose the ordered combinations with the help of a new column named "program2":

 #transpose the programs 
 require(reshape2) dcast(integrates2, name ~ program2, value.var=”program”)

Then I plan to use the following code to turn the result into a table and data frame and count frequencies:

 res = table(integrates2)
 resdf = as.data.frame(res)

I saw this used in the following link: Count number of time combination of events appear in dataframe columns ext

What I need from "program2" is to look like this:

  Name    program  date of contact   helper column   program2
  John     ffp        10/11/2014          2             1
  John     TP         10/27/2014          2             2
  Carlos   TP         11/19/2015          3             1
  Carlos   ffp        12/1/2015           3             2
  Carlos   wfd        12/31/2015          3             3

This way, I can use "program2" to transpose into different columns and then count the combinations. The final result should look something like this:

    program  pro1   pro2   freq      
     ffp     tp             2   
     TP      ffp    wfd     1    

I'm sure there are easier ways to do this, but as I am learning, this is where I am. Appreciate the help guys!

Upvotes: 4

Views: 827

Answers (2)

jazzurro
jazzurro

Reputation: 23574

After thinking about this question, I think the following would be the way to go. If you do not mind combining all program names, you could do the following. This is probably much better.

setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
           list(total = .N), by = type]

#         type total
#1:     ffp-TP     2
#2: TP-ffp-wfd     1

If you want to separate program names, you can do that with cSplit() from the splitstackshape package.

setDT(mydf)[, list(type = paste(program, collapse = "-")), by = name][,
              list(total = .N), by = type] -> temp

cSplit(temp, splitCols = "type", sep = "-")

#   total type_1 type_2 type_3
#1:     2    ffp     TP     NA
#2:     1     TP    ffp    wfd

The equivalence of dplyr code is:

group_by(mydf, name) %>%
summarise(type = paste(program, collapse = "-")) %>%
count(type)

#        type     n
#       (chr) (int)
#1     ffp-TP     2
#2 TP-ffp-wfd     1

DATA

mydf <- structure(list(name = c("John", "John", "Carlos", "Carlos", "Carlos", 
"Jen", "Jen"), program = c("ffp", "TP", "TP", "ffp", "wfd", "ffp", 
"TP"), dateOfContact = c("10/11/2014", "10/27/2014", "11/19/2015", 
"12/1/2015", "12/31/2015", "9/9/2014", "9/30/2014"), helperColumn = c(2L, 
2L, 3L, 3L, 3L, 2L, 2L)), .Names = c("name", "program", "dateOfContact", 
"helperColumn"), class = "data.frame", row.names = c(NA, -7L))

Upvotes: 2

alistaire
alistaire

Reputation: 43334

Edit: Return permutations

With dplyr,

library(dplyr)
integrates2 %>% group_by(name) %>% summarise(prg1 = program[1],
                                             prg2 = program[2],
                                             prg3 = program[3]) %>% 
  select(prg1, prg2, prg3) %>% group_by(prg1, prg2, prg3) %>% summarise(freq = n())

returns

Source: local data frame [2 x 4]
Groups: prg1, prg2 [?]

    prg1   prg2   prg3  freq
  (fctr) (fctr) (fctr) (int)
1    ffp     TP     NA     2
2     TP    ffp    wfd     1

With mydf2 from the comments, it produces

Source: local data frame [3 x 4]
Groups: prg1, prg2 [?]

   prg1  prg2  prg3  freq
  (chr) (chr) (chr) (int)
1   ffp    TP    NA     1
2    TP   ffp    NA     1
3   wfd    TP   ffp     1

The chain

  • calls group_by on name to separates cases;
  • summarise to transpose program into three columns;
  • select to narrow to those columns;
  • group_by all prg* columns so
  • summarise can chop them down to unique groups and add a freq count of occurrences from those groups.

Alternately, you can do the whole thing in base R if you like, though it's considerably less readable (with this particular approach, at least):

tab <- table(sapply(split(integrages2$program, integrates2$name), 
             function(x){paste(x, collapse = '-')}))
prgs <- strsplit(names(tab), '-')
programs <- do.call(rbind, lapply(prgs, function(x){
  c(x, rep(NA, max(sapply(prgs, length)-length(x))))
  }))
programs <- cbind(as.data.frame(programs), matrix(tab))
names(programs) <- c(paste0('prgm', seq(length(programs)-1)), 'freq')

A very quick-and-dirty version, which collapses the series into strings:

table(sapply(split(integrates2$program, integrates2$name), 
             function(x){paste(x, collapse = '-')}))

returns

ffp-TP TP-ffp-wfd 
     2          1 

or if wrapped in as.matrix,

           [,1]
ffp-TP        2
TP-ffp-wfd    1

Pre-edit version: Return combinations

Using reshape2, you can use dcast to make a data.frame of program combinations (chopping out names, which we don't care about, with [,-1]):

library(reshape2)
programs <- dcast(integrates2, name ~ program, value.var = 'program')[,-1]

programs looks like:

> programs
  ffp TP  wfd
1 ffp TP  wfd
2 ffp TP <NA>
3 ffp TP <NA>

You can now use dplyr to group by all the column names of programs (done programmatically here, but you can do it by hand with group_by(ffp, TP, wfd) if you prefer to see what's going on) and summarise, using n() to grab counts of how many rows are in the groups:

library(dplyr)
programs %>% group_by_(.dots = names(programs)) %>% summarise(freq = n())

which returns

Source: local data frame [2 x 4]
Groups: ffp, TP [?]

    ffp    TP   wfd  freq
  (chr) (chr) (chr) (int)
1   ffp    TP   wfd     1
2   ffp    TP    NA     2

Upvotes: 0

Related Questions