Reputation: 309

R data frame group by a column and create repetition number of another column

I have a data frame like this:

ii <- data.frame(cid = c(rep('a',8),rep('b',5)),
                 Interaction = c(rep('VCS',3), c('SLS'), rep('TCU',2), rep('MFM',2), rep('SLS', 2), 'COMM', rep('MFM',2)),
                 stringsAsFactors = F
)

   cid Interaction
1    a         VCS
2    a         VCS
3    a         VCS
4    a         SLS
5    a         TCU
6    a         TCU
7    a         MFM
8    a         MFM
9    b         SLS
10   b         SLS
11   b        COMM
12   b         MFM
13   b         MFM

And I would like to first group by cid then create another column that shows repetition number of Interaction columns. The result should look like this:

   cid Interaction   replicate
1    a         VCS   1
2    a         VCS   2
3    a         VCS   3
4    a         SLS   1
5    a         TCU   1
6    a         TCU   2
7    a         MFM   1
8    a         MFM   2
9    b         SLS   1
10   b         SLS   2
11   b        COMM   1
12   b         MFM   1
13   b         MFM   2

Eventually I wanted to also reshape this to a wide format (couldn't do it with the current format cause I lose duplicates) that would resemble something like:

   cid         InteractionTuple   
1    a         VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2    b         SLS1;SLS2;COMM;MFM1;MFM2

to be able to run association rule mining techniques, that are currently supporting repeated items per transaction.

Upvotes: 2

Answers (3)

BENY

Reputation: 323226

This answer based on dplyr
1st Part

Q1=ii%>%group_by(cid,Interaction)%>%
       mutate(replicate=rank(Interaction,ties.method="first"))
Q1
     cid Interaction replicate
   <chr>       <chr>     <int>
1      a         VCS         1
2      a         VCS         2
3      a         VCS         3
4      a         SLS         1
5      a         TCU         1
6      a         TCU         2
7      a         MFM         1
8      a         MFM         2
9      b         SLS         1
10     b         SLS         2
11     b        COMM         1
12     b         MFM         1
13     b         MFM         2

2nd Part

Q2=Q1%>%group_by(cid)%>%
        summarise(InteractionTuple=paste0(Interaction,replicate,collapse = ";"))
Q2
    # A tibble: 2 × 2
        cid                        InteractionTuple
      <chr>                                   <chr>
    1     a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
    2     b               SLS1;SLS2;COMM1;MFM1;MFM2

Upvotes: 1

Kristofersen

Reputation: 2806

Here's a data.table solution

library(data.table)
setDT(dt)
dt[ , "replicate" := 1:.N, by = .(Interaction, cid)]

    cid Interaction replicate
 1:   a         VCS         1
 2:   a         VCS         2
 3:   a         VCS         3
 4:   a         SLS         1
 5:   a         TCU         1
 6:   a         TCU         2
 7:   a         MFM         1
 8:   a         MFM         2
 9:   b         SLS         1
10:   b         SLS         2
11:   b        COMM         1
12:   b         MFM         1
13:   b         MFM         2

Edit part2:

dt2 = dt[ , .("InteractionTuple" = paste(Interaction, replicate, sep = "", collapse = ";")), by = .(cid)]
> dt2
   cid                        InteractionTuple
1:   a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2:   b               SLS1;SLS2;COMM1;MFM1;MFM2

Edit2

@MikeH suggested a different way which might be faster. Here are the results

microbenchmark(dt2 = dt[ , .("replicate" = 1:.N), by = .(Interaction, cid)],
               dt3 = dt[ , .("replicate" = seq_len(.N)), by = .(Interaction, cid)], times = 1000L)

Unit: microseconds
 expr     min      lq     mean   median       uq      max neval
  dt2 323.960 364.361 434.6370 402.8740 457.6220  2382.88  1000
  dt3 318.296 360.585 508.1313 397.3985 461.5865 42750.25  1000

The median is a little bit better using seq_len(.N).

Upvotes: 2

zx8754

Reputation: 56159

Using dplyr:

library(dplyr)

ii %>% 
  group_by(cid, Interaction) %>% 
  mutate(Interaction_rn = paste0(Interaction, row_number())) %>% 
  group_by(cid) %>% 
  summarise(InteractionTuple = paste(Interaction_rn, collapse = ";"))

# # A tibble: 2 x 2
#     cid                        InteractionTuple
#   <chr>                                   <chr>
# 1     a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
# 2     b               SLS1;SLS2;COMM1;MFM1;MFM2

Upvotes: 2

R data frame group by a column and create repetition number of another column

Answers (3)

Related Questions