Reputation: 309
I have a data frame like this:
ii <- data.frame(cid = c(rep('a',8),rep('b',5)),
Interaction = c(rep('VCS',3), c('SLS'), rep('TCU',2), rep('MFM',2), rep('SLS', 2), 'COMM', rep('MFM',2)),
stringsAsFactors = F
)
cid Interaction
1 a VCS
2 a VCS
3 a VCS
4 a SLS
5 a TCU
6 a TCU
7 a MFM
8 a MFM
9 b SLS
10 b SLS
11 b COMM
12 b MFM
13 b MFM
And I would like to first group by cid
then create another column that shows repetition number of Interaction
columns. The result should look like this:
cid Interaction replicate
1 a VCS 1
2 a VCS 2
3 a VCS 3
4 a SLS 1
5 a TCU 1
6 a TCU 2
7 a MFM 1
8 a MFM 2
9 b SLS 1
10 b SLS 2
11 b COMM 1
12 b MFM 1
13 b MFM 2
Eventually I wanted to also reshape this to a wide format (couldn't do it with the current format cause I lose duplicates) that would resemble something like:
cid InteractionTuple
1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2 b SLS1;SLS2;COMM;MFM1;MFM2
to be able to run association rule mining techniques, that are currently supporting repeated items per transaction.
Upvotes: 2
Views: 1444
Reputation: 323226
This answer based on dplyr
1st Part
Q1=ii%>%group_by(cid,Interaction)%>%
mutate(replicate=rank(Interaction,ties.method="first"))
Q1
cid Interaction replicate
<chr> <chr> <int>
1 a VCS 1
2 a VCS 2
3 a VCS 3
4 a SLS 1
5 a TCU 1
6 a TCU 2
7 a MFM 1
8 a MFM 2
9 b SLS 1
10 b SLS 2
11 b COMM 1
12 b MFM 1
13 b MFM 2
2nd Part
Q2=Q1%>%group_by(cid)%>%
summarise(InteractionTuple=paste0(Interaction,replicate,collapse = ";"))
Q2
# A tibble: 2 × 2
cid InteractionTuple
<chr> <chr>
1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2 b SLS1;SLS2;COMM1;MFM1;MFM2
Upvotes: 1
Reputation: 2806
Here's a data.table solution
library(data.table)
setDT(dt)
dt[ , "replicate" := 1:.N, by = .(Interaction, cid)]
cid Interaction replicate
1: a VCS 1
2: a VCS 2
3: a VCS 3
4: a SLS 1
5: a TCU 1
6: a TCU 2
7: a MFM 1
8: a MFM 2
9: b SLS 1
10: b SLS 2
11: b COMM 1
12: b MFM 1
13: b MFM 2
Edit part2:
dt2 = dt[ , .("InteractionTuple" = paste(Interaction, replicate, sep = "", collapse = ";")), by = .(cid)]
> dt2
cid InteractionTuple
1: a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2: b SLS1;SLS2;COMM1;MFM1;MFM2
Edit2
@MikeH suggested a different way which might be faster. Here are the results
microbenchmark(dt2 = dt[ , .("replicate" = 1:.N), by = .(Interaction, cid)],
dt3 = dt[ , .("replicate" = seq_len(.N)), by = .(Interaction, cid)], times = 1000L)
Unit: microseconds
expr min lq mean median uq max neval
dt2 323.960 364.361 434.6370 402.8740 457.6220 2382.88 1000
dt3 318.296 360.585 508.1313 397.3985 461.5865 42750.25 1000
The median is a little bit better using seq_len(.N)
.
Upvotes: 2
Reputation: 56159
Using dplyr:
library(dplyr)
ii %>%
group_by(cid, Interaction) %>%
mutate(Interaction_rn = paste0(Interaction, row_number())) %>%
group_by(cid) %>%
summarise(InteractionTuple = paste(Interaction_rn, collapse = ";"))
# # A tibble: 2 x 2
# cid InteractionTuple
# <chr> <chr>
# 1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
# 2 b SLS1;SLS2;COMM1;MFM1;MFM2
Upvotes: 2