Reputation: 19
I have a dataframe in following format:
sample_df <- structure(list(conversationid = c("C1", "C2", "C2", "C2",
"C2", "C2", "C3", "C3", "C3", "C3"),
sentby = c("Consumer","Consumer", "Agent", "Agent", "Agent", "Consumer",
"Agent", "Consumer","Agent", "Agent"),
time = c("2018-04-25 03:54:04.550+0000", "2018-05-11 19:18:05.094+0000",
"2018-05-11 19:18:09.218+0000", "2018-05-11 19:18:09.467+0000",
"2018-05-11 19:18:13.527+0000", "2018-05-14 22:57:10.004+0000",
"2018-05-14 22:57:14.330+0000", "2018-05-14 22:57:20.795+0000",
"2018-05-14 22:57:22.168+0000", "2018-05-14 22:57:24.203+0000"),
diff = c(NA, NA, 0.0687333333333333, 0.00415, 0.0676666666666667, NA, 0.0721,
0.10775, 0.0228833333333333,0.0339166666666667)),
.Names = c("conversationid", "sentby","time","diff"), row.names = c(NA, 10L),
class = "data.frame")
Where conversationid is a conversation id and can contain messages sent by either an agent or a customer. What I would like to do is, maintain a running count whenever "Agent" shows up in the conversation, like this:
Target Output:
conversationid sentby diff agent_counter_flag
C1 Consumer NA 0
C2 Consumer NA 0
C2 Agent 0.06873333 1
C2 Agent 0.00415 2
C2 Agent 0.06766667 3
C2 Consumer NA 0
C3 Agent 0.0721 1
C3 Consumer 0.10775 0
C3 Agent 0.02288333 2
C3 Agent 0.03391667 3
Currently, I am able to partition the dataframe and rank all records grouped by cid using following code:
setDT(sample_df)
sample_df[,Order := rank(time, ties.method = "first"), by = "conversationid"]
sample_df <- as.data.frame(sample_df)
But all it does is rank records within a partition disregarding if it's an "Agent" or "Customer".
Current Output:
conversationid sentby diff Order
C1 Consumer NA 1
C2 Consumer NA 1
C2 Agent 0.06873333 2
C2 Agent 0.00415 3
C2 Agent 0.06766667 4
C2 Consumer NA 5
C3 Agent 0.0721 1
C3 Consumer 0.10775 2
C3 Agent 0.02288333 3
C3 Agent 0.03391667 4
How do I proceed so I can have my dataframe as shown in target output? Thanks!
Upvotes: 0
Views: 89
Reputation: 16881
Came across this post trying to solve a similar issue with dplyr
. You can sum logical values where you've tested for sentby == "Agent"
using dplyr
's grouping.
The long way, just to spell out what the logical column will look like:
library(dplyr)
sample_df %>%
mutate(is_agent = sentby == "Agent") %>%
group_by(conversationid) %>%
mutate(agent_counter_flag = ifelse(is_agent, cumsum(is_agent), 0)) %>%
ungroup()
#> # A tibble: 10 x 6
#> conversationid sentby time diff is_agent agent_counter_f…
#> <chr> <chr> <chr> <dbl> <lgl> <dbl>
#> 1 C1 Consum… 2018-04-25 03… NA FALSE 0
#> 2 C2 Consum… 2018-05-11 19… NA FALSE 0
#> 3 C2 Agent 2018-05-11 19… 0.0687 TRUE 1
#> 4 C2 Agent 2018-05-11 19… 0.00415 TRUE 2
#> 5 C2 Agent 2018-05-11 19… 0.0677 TRUE 3
#> 6 C2 Consum… 2018-05-14 22… NA FALSE 0
#> 7 C3 Agent 2018-05-14 22… 0.0721 TRUE 1
#> 8 C3 Consum… 2018-05-14 22… 0.108 FALSE 0
#> 9 C3 Agent 2018-05-14 22… 0.0229 TRUE 2
#> 10 C3 Agent 2018-05-14 22… 0.0339 TRUE 3
You'd probably want to follow that up with select(-is_agent)
to drop the logical column.
Or in practice, for an abbreviated form you can call cumsum
inside the mutate
.
sample_df %>%
group_by(conversationid) %>%
mutate(agent_counter_flag = ifelse(sentby == "Agent", cumsum(sentby == "Agent"), 0)) %>%
ungroup()
Either way, the idea is that within each conversationid
, you add the number of sentby == "Agent"
so far if it's sent by an agent, or just set to 0 if it's not sent by an agent.
Upvotes: 0
Reputation: 28705
library(data.table)
setDT(sample_df)
sample_df[, agent_counter_flag := {sba = (sentby == 'Agent'); sba*cumsum(sba)}
, by = conversationid]
sample_df
# conversationid sentby time diff agent_counter_flag
# 1: C1 Consumer 2018-04-25 03:54:04.550+0000 NA 0
# 2: C2 Consumer 2018-05-11 19:18:05.094+0000 NA 0
# 3: C2 Agent 2018-05-11 19:18:09.218+0000 0.06873333 1
# 4: C2 Agent 2018-05-11 19:18:09.467+0000 0.00415000 2
# 5: C2 Agent 2018-05-11 19:18:13.527+0000 0.06766667 3
# 6: C2 Consumer 2018-05-14 22:57:10.004+0000 NA 0
# 7: C3 Agent 2018-05-14 22:57:14.330+0000 0.07210000 1
# 8: C3 Consumer 2018-05-14 22:57:20.795+0000 0.10775000 0
# 9: C3 Agent 2018-05-14 22:57:22.168+0000 0.02288333 2
# 10: C3 Agent 2018-05-14 22:57:24.203+0000 0.03391667 3
As @Frank points out, this also works
sample_df[, agent_counter_flag := rowid(conversationid, sentby)*(sentby == "Agent")]
Benchmark
sample_df <- replicate(1000, sample_df, simplify = F) %>% rbindlist
microbenchmark(
rowidFrank = sample_df[, agent_counter_flag :=
rowid(conversationid, sentby)*(sentby == "Agent")]
, rowidUwe = sample_df[sentby == "Agent", agent_counter_flag := rowid(conversationid)]
, cumsum = sample_df[, agent_counter_flag := {sba = (sentby == 'Agent'); sba*cumsum(sba)}
, by = conversationid]
, unit = 'relative')
# Unit: relative
# expr min lq mean median uq max neval
# rowidFrank 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
# rowidUwe 1.448858 1.438742 1.410849 1.414428 1.535292 0.5549433 100
# cumsum 1.322493 1.306228 1.316188 1.261325 1.308371 1.6431036 100
Upvotes: 2
Reputation: 42602
This is my data.table
solution which uses the rowid()
function and creates the new column agent_counter_flag
by reference:
library(data.table)
setDT(sample_df)
sample_df[sentby == "Agent", agent_counter_flag := rowid(conversationid)][]
conversationid sentby time diff agent_counter_flag 1: C1 Consumer 2018-04-25 03:54:04.550+0000 NA NA 2: C2 Consumer 2018-05-11 19:18:05.094+0000 NA NA 3: C2 Agent 2018-05-11 19:18:09.218+0000 0.06873333 1 4: C2 Agent 2018-05-11 19:18:09.467+0000 0.00415000 2 5: C2 Agent 2018-05-11 19:18:13.527+0000 0.06766667 3 6: C2 Consumer 2018-05-14 22:57:10.004+0000 NA NA 7: C3 Agent 2018-05-14 22:57:14.330+0000 0.07210000 1 8: C3 Consumer 2018-05-14 22:57:20.795+0000 0.10775000 NA 9: C3 Agent 2018-05-14 22:57:22.168+0000 0.02288333 2 10: C3 Agent 2018-05-14 22:57:24.203+0000 0.03391667 3
Upvotes: 1
Reputation: 1573
Here you are:
library(dplyr)
df <- data.frame(cid = c(rep("c1", 6), rep("C2", 4)),
Sent_by = c("Consumer", "Agent", "Consumer", "Consumer", "Agent", "Agent",
"Consumer", "Agent", "Agent", "Consumer"))
df %>% group_by(cid, Sent_by) %>%
mutate(agent_flag = ifelse(Sent_by == "Agent", 1:n(), NA),
consumer_flag = ifelse(Sent_by == "Consumer", 1:n(), NA))
Upvotes: 0