user2092493
user2092493

Reputation: 19

How to add incremental rank based on column value?

I have a dataframe in following format:

sample_df <- structure(list(conversationid = c("C1",  "C2", "C2",  "C2", 
"C2",  "C2", "C3",  "C3", "C3",  "C3"), 
sentby = c("Consumer","Consumer", "Agent", "Agent", "Agent", "Consumer", 
"Agent", "Consumer","Agent", "Agent"), 
time = c("2018-04-25 03:54:04.550+0000", "2018-05-11 19:18:05.094+0000", 
     "2018-05-11 19:18:09.218+0000", "2018-05-11 19:18:09.467+0000", 
     "2018-05-11 19:18:13.527+0000", "2018-05-14 22:57:10.004+0000", 
     "2018-05-14 22:57:14.330+0000", "2018-05-14 22:57:20.795+0000", 
     "2018-05-14 22:57:22.168+0000", "2018-05-14 22:57:24.203+0000"),
diff = c(NA, NA, 0.0687333333333333, 0.00415, 0.0676666666666667, NA, 0.0721, 
0.10775, 0.0228833333333333,0.0339166666666667)), 
.Names = c("conversationid", "sentby","time","diff"), row.names = c(NA, 10L), 
class = "data.frame")

Where conversationid is a conversation id and can contain messages sent by either an agent or a customer. What I would like to do is, maintain a running count whenever "Agent" shows up in the conversation, like this:

Target Output:

conversationid  sentby  diff    agent_counter_flag
        C1     Consumer NA          0
        C2     Consumer NA          0
        C2     Agent    0.06873333  1
        C2     Agent    0.00415     2
        C2     Agent    0.06766667  3
        C2     Consumer NA          0
        C3     Agent    0.0721      1
        C3     Consumer 0.10775     0
        C3     Agent    0.02288333  2
        C3     Agent    0.03391667  3

Currently, I am able to partition the dataframe and rank all records grouped by cid using following code:

setDT(sample_df)
sample_df[,Order := rank(time, ties.method = "first"), by = "conversationid"]
sample_df <- as.data.frame(sample_df)

But all it does is rank records within a partition disregarding if it's an "Agent" or "Customer".

Current Output:

   conversationid   sentby  diff    Order
        C1     Consumer NA          1
        C2     Consumer NA          1
        C2     Agent    0.06873333  2
        C2     Agent    0.00415     3
        C2     Agent    0.06766667  4
        C2     Consumer NA          5
        C3     Agent    0.0721      1
        C3     Consumer 0.10775     2
        C3     Agent    0.02288333  3
        C3     Agent    0.03391667  4

How do I proceed so I can have my dataframe as shown in target output? Thanks!

Upvotes: 0

Views: 89

Answers (4)

camille
camille

Reputation: 16881

Came across this post trying to solve a similar issue with dplyr. You can sum logical values where you've tested for sentby == "Agent" using dplyr's grouping.

The long way, just to spell out what the logical column will look like:

library(dplyr)

sample_df %>%
  mutate(is_agent = sentby == "Agent") %>%
  group_by(conversationid) %>%
  mutate(agent_counter_flag = ifelse(is_agent, cumsum(is_agent), 0)) %>%
  ungroup()
#> # A tibble: 10 x 6
#>    conversationid sentby  time               diff is_agent agent_counter_f…
#>    <chr>          <chr>   <chr>             <dbl> <lgl>               <dbl>
#>  1 C1             Consum… 2018-04-25 03… NA       FALSE                   0
#>  2 C2             Consum… 2018-05-11 19… NA       FALSE                   0
#>  3 C2             Agent   2018-05-11 19…  0.0687  TRUE                    1
#>  4 C2             Agent   2018-05-11 19…  0.00415 TRUE                    2
#>  5 C2             Agent   2018-05-11 19…  0.0677  TRUE                    3
#>  6 C2             Consum… 2018-05-14 22… NA       FALSE                   0
#>  7 C3             Agent   2018-05-14 22…  0.0721  TRUE                    1
#>  8 C3             Consum… 2018-05-14 22…  0.108   FALSE                   0
#>  9 C3             Agent   2018-05-14 22…  0.0229  TRUE                    2
#> 10 C3             Agent   2018-05-14 22…  0.0339  TRUE                    3

You'd probably want to follow that up with select(-is_agent) to drop the logical column.

Or in practice, for an abbreviated form you can call cumsum inside the mutate.

sample_df %>%
  group_by(conversationid) %>%
  mutate(agent_counter_flag = ifelse(sentby == "Agent", cumsum(sentby == "Agent"), 0)) %>%
  ungroup()

Either way, the idea is that within each conversationid, you add the number of sentby == "Agent" so far if it's sent by an agent, or just set to 0 if it's not sent by an agent.

Upvotes: 0

IceCreamToucan
IceCreamToucan

Reputation: 28705

library(data.table)
setDT(sample_df)

sample_df[, agent_counter_flag := {sba = (sentby == 'Agent'); sba*cumsum(sba)}
          , by = conversationid]
sample_df

#     conversationid   sentby                         time       diff agent_counter_flag
#  1:             C1 Consumer 2018-04-25 03:54:04.550+0000         NA                  0
#  2:             C2 Consumer 2018-05-11 19:18:05.094+0000         NA                  0
#  3:             C2    Agent 2018-05-11 19:18:09.218+0000 0.06873333                  1
#  4:             C2    Agent 2018-05-11 19:18:09.467+0000 0.00415000                  2
#  5:             C2    Agent 2018-05-11 19:18:13.527+0000 0.06766667                  3
#  6:             C2 Consumer 2018-05-14 22:57:10.004+0000         NA                  0
#  7:             C3    Agent 2018-05-14 22:57:14.330+0000 0.07210000                  1
#  8:             C3 Consumer 2018-05-14 22:57:20.795+0000 0.10775000                  0
#  9:             C3    Agent 2018-05-14 22:57:22.168+0000 0.02288333                  2
# 10:             C3    Agent 2018-05-14 22:57:24.203+0000 0.03391667                  3

As @Frank points out, this also works

sample_df[, agent_counter_flag := rowid(conversationid, sentby)*(sentby == "Agent")]

Benchmark

sample_df <- replicate(1000, sample_df, simplify = F) %>% rbindlist
microbenchmark(
  rowidFrank = sample_df[, agent_counter_flag := 
                           rowid(conversationid, sentby)*(sentby == "Agent")]
, rowidUwe = sample_df[sentby == "Agent", agent_counter_flag := rowid(conversationid)]
, cumsum   = sample_df[, agent_counter_flag := {sba = (sentby == 'Agent'); sba*cumsum(sba)}
                       , by = conversationid]
, unit = 'relative')

# Unit: relative
# expr            min       lq     mean   median       uq       max neval
# rowidFrank 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100
# rowidUwe   1.448858 1.438742 1.410849 1.414428 1.535292 0.5549433   100
# cumsum     1.322493 1.306228 1.316188 1.261325 1.308371 1.6431036   100

Upvotes: 2

Uwe
Uwe

Reputation: 42602

This is my data.table solution which uses the rowid() function and creates the new column agent_counter_flag by reference:

library(data.table)
setDT(sample_df)
sample_df[sentby == "Agent", agent_counter_flag := rowid(conversationid)][]
    conversationid   sentby                         time       diff agent_counter_flag
 1:             C1 Consumer 2018-04-25 03:54:04.550+0000         NA                 NA
 2:             C2 Consumer 2018-05-11 19:18:05.094+0000         NA                 NA
 3:             C2    Agent 2018-05-11 19:18:09.218+0000 0.06873333                  1
 4:             C2    Agent 2018-05-11 19:18:09.467+0000 0.00415000                  2
 5:             C2    Agent 2018-05-11 19:18:13.527+0000 0.06766667                  3
 6:             C2 Consumer 2018-05-14 22:57:10.004+0000         NA                 NA
 7:             C3    Agent 2018-05-14 22:57:14.330+0000 0.07210000                  1
 8:             C3 Consumer 2018-05-14 22:57:20.795+0000 0.10775000                 NA
 9:             C3    Agent 2018-05-14 22:57:22.168+0000 0.02288333                  2
10:             C3    Agent 2018-05-14 22:57:24.203+0000 0.03391667                  3

Upvotes: 1

Mislav
Mislav

Reputation: 1573

Here you are:

library(dplyr)

df <- data.frame(cid = c(rep("c1", 6), rep("C2", 4)),
                 Sent_by = c("Consumer", "Agent", "Consumer", "Consumer", "Agent", "Agent",
                             "Consumer", "Agent", "Agent", "Consumer"))
df %>% group_by(cid, Sent_by) %>%
  mutate(agent_flag = ifelse(Sent_by == "Agent", 1:n(), NA),
         consumer_flag = ifelse(Sent_by == "Consumer", 1:n(), NA))

Upvotes: 0

Related Questions