user13467695
user13467695

Reputation:

Count number of specific rows within a group

I have a dataframe:

ID       SUB_ID     Action  
1         NA         Preparate1 
1         NA         Preparate2 
1         A          Open
1         A          Download
1         A          Close
1         B          Open
1         B          Search
1         B          Download
1         B          Close
2         AA          Open
2         AA          Download
2         AA          Close
2         BB          Open
2         BB          Search
2         BB          Filter
2         BB          Close
3         C           Open
3         C           Search
3         C           Filter
3         C           Close

I want to get table with ID and number of SUB_ID per ID and number of "Download" in column Action. So, desired result is:

ID       SUB_ID_n     Download_n 
1         2            2
2         2            1
3         1            0

How could i do that? As you see there are sometimes rows with NA in SUB_ID. I don't want to count those NA as additional SUB_ID within ID.

I have tried this but its counting NA in SUB_ID:

setDT(df)[, .(SUB_ID_n = uniqueN(SUB_ID), 
              Download_n = sum(Action == 'Download')), ID]

Upvotes: 0

Views: 44

Answers (2)

akrun
akrun

Reputation: 887118

here, we need to use na.rm = TRUE as by default is FALSE

library(data.table)
setDT(df)[, .(SUB_ID_n = uniqueN(SUB_ID, na.rm = TRUE), 
               Download_n = sum(Action == 'Download')), ID]
#   ID SUB_ID_n Download_n
#1:  1        2          2
#2:  2        2          1
#3:  3        1          0

Similarly, the n_distinct from dplyr also have na.rm argument

library(dplyr)
df %>% 
    group_by(ID) %>% 
    summarise(SUB_ID_n = n_distinct(SUB_ID, na.rm = TRUE),
              Download_n = sum(Action == 'Download'), .groups = 'drop')
# A tibble: 3 x 3
#     ID SUB_ID_n Download_n
#  <int>    <int>      <int>
#1     1        2          2
#2     2        2          1
#3     3        1          0

data

df <-  structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), SUB_ID = c(NA, NA, "A", 
"A", "A", "B", "B", "B", "B", "AA", "AA", "AA", "BB", "BB", "BB", 
"BB", "C", "C", "C", "C"), Action = c("Preparate1", "Preparate2", 
"Open", "Download", "Close", "Open", "Search", "Download", "Close", 
"Open", "Download", "Close", "Open", "Search", "Filter", "Close", 
"Open", "Search", "Filter", "Close")), class = "data.frame", row.names = c(NA, 
-20L))

Upvotes: 0

Duck
Duck

Reputation: 39595

Try this tidyverse approach. The key is to subset non NA values in summarise(). Here the code:

library(tidyverse)
#Data and code
df %>% group_by(ID) %>%
  summarise(N_id=n_distinct(SUB_ID[!is.na(SUB_ID)]),
            N_Action=length(Action[Action=='Download']))

Output:

# A tibble: 3 x 3
     ID  N_id N_Action
  <int> <int>    <int>
1     1     2        2
2     2     2        1
3     3     1        0

Some data used:

#Data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), SUB_ID = c(NA, NA, "A", 
"A", "A", "B", "B", "B", "B", "AA", "AA", "AA", "BB", "BB", "BB", 
"BB", "C", "C", "C", "C"), Action = c("Preparate1", "Preparate2", 
"Open", "Download", "Close", "Open", "Search", "Download", "Close", 
"Open", "Download", "Close", "Open", "Search", "Filter", "Close", 
"Open", "Search", "Filter", "Close")), class = "data.frame", row.names = c(NA, 
-20L))

Upvotes: 1

Related Questions