Sophia Wilson
Sophia Wilson

Reputation: 477

How to get exclusive and total count from multiple ids in R

I have below-mentioned dataframe in R.

CusID     Date                   Type        LogCat             PriceCode
C-1       2021-01-02 14:13:10    Demo                           
C-2       2021-01-02 13:17:07    Pro         SPR, DET, RTD      KR
C-2       2021-01-02 14:15:10    Pro         SPR, DET, RTD      SR
C-2       2021-01-02 16:14:08    Pro         SPR, DET, RTD      DE
C-3       2021-01-02 17:11:03    Pro         DET                KL
C-3       2021-01-02 12:14:24    Pro         DET                ZT
C-3       2021-01-02 12:33:34    Pro         DET                KR
C-4       2021-01-02 10:43:55    Pro                            KR
C-5       2021-01-03 20:23:35    Pro         SPR, RTD           KR
  

Dupt<-

structure(list(CustID = c("C-1", "C-2", 
"C-2", "C-2", "C-3", "C-3", 
"C-3", "C-4", "C-5"), DATE = c("2021-01-02 14:13:10", "2021-01-02 13:17:07", "2021-01-02 14:15:10", "2021-01-02 16:14:08", "2021-01-02 17:11:03", "2021-01-02 12:14:24", "2021-01-02 12:33:34", "2021-01-02 10:43:55", "2021-01-03 20:23:35"), TYPE = c("Demo", 
"Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro"
), LogCat = c(NA, "SPR,DET,RTD", "SPR,DET,RTD", "SPR,DET,RTD", 
"DET", "DET", "DET", NA, " SPR, RTD "), PriceCode = c(NA,"KR", "SR", "DE", "KL", "ZT", "KR", "KR", "KR")), class = "data.frame", row.names = c(NA, 
-9L))

By using the above dataframe, I need to create following two different output dataframe based on Date and Month. For Example, let take Date 2021-01-02.

Required Dataframe1<-

Total         4        100.00%
Demo          1        25.00%
Pro           3        75.00%

For Required Dataframe 2 we need to consider only those CusID which has Type not equal to Demo. We need to get the following count for LogCat and PriceCode for Such CustId. The thing to note here is the value in LogCat and PriceCode are not standard and can change from date to date so that we can't make it hard code.

The requirement here is to count the value mentioned in LogCate and PriceCode uniquely for each uniuqe CustID.

For Example, the DET text is uniquely available in CustID C-3 without any other comma separated value and similarly KR text is uniquely available in CustID C-4 without having any other value in ProceCode column for that particular (C-4) CustID.

The (All) concatenated with each unique LogCat and PriceCode value implies that it has present with combination with other value in that particular column.

The last NA column is static for only in case if there are some CustID having no LogCat and PriceCode value but still has Type as Pro.

Required Dataframe2<-

LogCat            Count        %          PriceCode       Count        %
SPR               0            0.00%      KR              1            50.00%
SPR (All)         1            50.00%     KR (All)        3            100.00%
DET               1            50.00%     SR              0            0.00%
DET (All)         2            100.00%    SR (All)        1            33.33%
RTD               0            0.00%      DE              0            0.00%
RTD (All)         1            50.00%     DE (All)        1            33.33%
Blank             1            33.33%     ZT              0            0.00%
-                 -            -          ZT (All)        1            33.33%
-                 -            -          KL              0            0.00%
-                 -            -          KL (All)        1            33.33%
-                 -            -          Blank           0            0.00%
Both Blank        0            0.00%      -               0            0.00%

Upvotes: 4

Views: 546

Answers (2)

TimTeaFan
TimTeaFan

Reputation: 18541

I added some comments and explanation to my answer. I think this approach should work, but a few things are not clear to me:

  1. You want a summary for each month, but your toy data only contains dates in January 2021 (I changed one entry to at least have two month to work with).

  2. How should the final output format look like? You describe how one data.frame for one month should look like, but how should the final ouput for all month look like? I opted for a nested tibble where each month has its own summary data.frame.

  3. I assume that the variable ending with "(All)" contains the count of all occurrences. The variables not ending with "(All)" contain the count for the case the category is the only one mentioned. Is this correct?

  4. I do not understand how the final frequencies, especially for LogCat are calculated. I come to the same numbers as your example output for PriceCode when dividing the count of each category by the total number of CustIDs. But when looking at your frequencies of LogCat it seems like DET (All) is 100% (although we have 3 CustIDs) suggesting that all (All) variables show a relative frequency which should always be 100%. DET has count 1 and a frequency of 50%. So in this case it seems to be the relative count. But then RTD (All) has a count of 1 and a frequency of 50%. This frequency is relative to what? The total number if CustIDs on that day for TYPE != "Demo" is 3.

Apart from this, the code below, although somewhat longish, should work, and should be easily adjustable to other definitions of frequency.

Setup:

# The setup contains the initial data slightly adjusted so that it contains
# two different month
# 
# we use the `tidyverse` library and some custom helper functions

# initial data: slightly changed so that it contains two different month
dat <- structure(
  list(
    CustID = c("C-1", "C-2",
               "C-2", "C-2", "C-3", "C-3",
               "C-3", "C-4", "C-5"),
    DATE = c(
      "2021-01-02 14:13:10",
      "2021-01-02 13:17:07",
      "2021-01-02 14:15:10",
      "2021-01-02 16:14:08",
      "2021-01-02 17:11:03",
      "2021-01-02 12:14:24",
      "2021-01-02 12:33:34",
      "2021-01-02 10:43:55",
      "2021-02-01 20:23:35" # I changed this to february to have two different month
    ),
    TYPE = c("Demo",
             "Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro"),
    LogCat = c(
      NA,
      "SPR,DET,RTD",
      "SPR,DET,RTD",
      "SPR,DET,RTD",
      "DET",
      "DET",
      "DET",
      NA,
      " SPR, RTD "
    ),
    PriceCode = c(NA, "KR", "SR", "DE", "KL", "ZT", "KR", "KR", "KR")
  ),
  class = "data.frame",
  row.names = c(NA,-9L)
)


# helper functions
# some additional helper functions to get the desired output format:

# adds `length` nrows of NA's to a data.frame
extend_df <- function(df, length) {
  fill_df <- df[seq_len(length), ] 
  fill_df[,] <- NA
  rbind(df, fill_df)
} 

# cbind (column bind) two data.frames with different number of rows
# by appending NA with `extend_df()`
async_cbind <- function(df1, df2) {
  df1_r <- nrow(df1)
  df2_r <- nrow(df2)
  
  if (df1_r < df2_r) {
    df1 <- extend_df(df1, df2_r - df1_r)
    return(cbind(df1, df2))
  } else if (df1_r > df2_r) {
    df2 <- extend_df(df2, df1_r - df2_r)
    return(cbind(df1, df2))
  } else {
    cbind(df1, df2)
  }
}

# split a data.frame by `cbind`ing each group
split_wide <- function(df, g) {
  v <- as.character(substitute(g))
  df_ls <- split(df[, -which(names(df) == v)], df[[v]])
  out <- Reduce(async_cbind, df_ls)
  names(out) <- make.names(names(out), unique = TRUE)
  out
}

library(tidyverse)

To get the data.frame of the first part:

res1 <- dat %>% 
  # convert `TYPE` to `factor`
  mutate(TYPE = as.factor(TYPE)) %>% 
  # `nest_by()` year and month with `substr` which gives us a nested `tibble`
  # with one `tibble` per month
  nest_by(month = paste(substr(DATE, 1, 4), substr(DATE, 6,7), sep = "-")) %>%
  # lets take each nested `tibble` and nest it again by appending the same data
  # but with `TYPE` set as `"Total"`
  mutate(data = list(
    tibble(dat = list(mutate(data, TYPE = "Total"), data)) %>% 
      # we continue `rowsie` and summarise the inner nested `tibbles` by `TYPE`
      rowwise() %>% 
      mutate(dat = list(dat %>% 
                          group_by(TYPE, .drop = FALSE) %>% 
                          summarise(n = n_distinct(CustID)) %>% 
                          # let's add the frequency counts
                          mutate(freq = prop.table(n)))) %>% 
      # and unnest
      unnest(dat)))

This gives us the following nested tibble ...

res1
#> # A tibble: 2 x 2
#> # Rowwise:  month
#>   month   data            
#>   <chr>   <list>          
#> 1 2021-01 <tibble [3 × 3]>
#> 2 2021-02 <tibble [3 × 3]>

... where the data column contains one data.frame for each month:

res1 %>% pull(data)
#> [[1]]
#> # A tibble: 3 x 3
#>   TYPE      n  freq
#>   <chr> <int> <dbl>
#> 1 Total     4  1   
#> 2 Demo      1  0.25
#> 3 Pro       3  0.75
#> 
#> [[2]]
#> # A tibble: 3 x 3
#>   TYPE      n  freq
#>   <chr> <int> <dbl>
#> 1 Total     1     1
#> 2 Demo      0     0
#> 3 Pro       1     1

To get the data.frame of the second part:

It is preferable to have the data in a different format, so that PriceCode contains multiple comma separated codes similar to LogCat.

res2a <- dat %>% 
  # filter out all rows with `TYPE == "Demo"` 
  filter(TYPE != "Demo") %>%
  # add month variable containing year and month as character string.
  mutate(month = paste(substr(DATE, 1, 4), substr(DATE,6,7), sep = "-")) %>% 
  # group by `CustID` and `month`
  group_by(CustID, month) %>% 
  # collapse `PriceCode` so that format is similar to `LogCat`
  mutate(PriceCode = paste(PriceCode, collapse= ",")) %>% 
  # ungroup and drop `TYPE` and `DATE`
  ungroup() %>% 
  select(!c(TYPE, DATE))

Lets have a look at this first step:

res2a
#> # A tibble: 8 x 4
#>   CustID LogCat        PriceCode month  
#>   <chr>  <chr>         <chr>     <chr>  
#> 1 C-2    "SPR,DET,RTD" KR,SR,DE  2021-01
#> 2 C-2    "SPR,DET,RTD" KR,SR,DE  2021-01
#> 3 C-2    "SPR,DET,RTD" KR,SR,DE  2021-01
#> 4 C-3    "DET"         KL,ZT,KR  2021-01
#> 5 C-3    "DET"         KL,ZT,KR  2021-01
#> 6 C-3    "DET"         KL,ZT,KR  2021-01
#> 7 C-4     <NA>         KR        2021-01
#> 8 C-5    " SPR, RTD "  KR        2021-02

In the next step we create all the columns we need for the final output table:

# First, we need to create named vectors to loop over both columns `LogCat` and `PriceCode`
# alternatively we could use dplyover::crossover() (a package I maintain on GitHub)
logcat_cols <-
  dat$LogCat[!is.na(dat$LogCat)] %>%
  strsplit(",") %>% 
  unlist %>%
  trimws %>%
  unique %>%
  set_names(., paste0("LogCat__", ., "(All)"))

pricecode_cols <-
  dat$PriceCode[!is.na(dat$PriceCode)] %>%
  strsplit(",") %>% 
  unlist %>%
  trimws %>%
  unique %>%
  set_names(., paste0("PriceCode__", ., "(All)"))

res2b <- res2a %>% 
  mutate(
         # we need to keep month and CustID 
         month = month,
         CustID = CustID,
         # this gives us one dummy column for each LogCat
         map_dfc(logcat_cols, ~ as.integer(grepl(.x, LogCat))),
         # this gives us one dummy column for each PriceCode
         map_dfc(pricecode_cols, ~ as.integer(grepl(.x, PriceCode))),
         # Lets create another set of dummy columns checking if the category was the only one mentioned
         # we do this of LogCat ...
         across(starts_with("LogCat") & ends_with("(All)"),
                list("d" = ~ if_else(rowSums(select(cur_data(), starts_with("LogCat") & ends_with("(All)"))) == .x & .x == 1, 1, 0))),
         # ... and for PriceCode, and for now lets append a `_d` as suffix to the variable name
         across(starts_with("PriceCode") & ends_with("(All)"),
                list("d" = ~ if_else(rowSums(select(cur_data(), starts_with("PriceCode") & ends_with("(All)"))) == .x & .x == 1, 1, 0))),
         # Lets add some final columns for when LogCat PriceCode are NA
         across(c(LogCat, PriceCode),
                ~ if_else(is.na(.x), 1, 0),
                .names = "{col}__Blank"),
         # lets drop the original columns, we don't need them anymore
         .keep = "none") %>% 
  # we still need to delete the suffix `_d` from some columns
  rename_with(~ gsub("\\(All\\)_d$", "", .x), ends_with("_d"))

Lets have a glimpse at the data with all needed columns:

res2b %>% glimpse
#> Rows: 8
#> Columns: 20
#> $ month                <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021…
#> $ CustID               <chr> "C-2", "C-2", "C-2", "C-3", "C-3", "C-3", "C-4", …
#> $ `LogCat__SPR(All)`   <int> 1, 1, 1, 0, 0, 0, 0, 1
#> $ `LogCat__DET(All)`   <int> 1, 1, 1, 1, 1, 1, 0, 0
#> $ `LogCat__RTD(All)`   <int> 1, 1, 1, 0, 0, 0, 0, 1
#> $ `PriceCode__KR(All)` <int> 1, 1, 1, 1, 1, 1, 1, 1
#> $ `PriceCode__SR(All)` <int> 1, 1, 1, 0, 0, 0, 0, 0
#> $ `PriceCode__DE(All)` <int> 1, 1, 1, 0, 0, 0, 0, 0
#> $ `PriceCode__KL(All)` <int> 0, 0, 0, 1, 1, 1, 0, 0
#> $ `PriceCode__ZT(All)` <int> 0, 0, 0, 1, 1, 1, 0, 0
#> $ LogCat__SPR          <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ LogCat__DET          <dbl> 0, 0, 0, 1, 1, 1, 0, 0
#> $ LogCat__RTD          <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ PriceCode__KR        <dbl> 0, 0, 0, 0, 0, 0, 1, 1
#> $ PriceCode__SR        <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ PriceCode__DE        <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ PriceCode__KL        <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ PriceCode__ZT        <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ LogCat__Blank        <dbl> 0, 0, 0, 0, 0, 0, 1, 0
#> $ PriceCode__Blank     <dbl> 0, 0, 0, 0, 0, 0, 0, 0

After we have created the data that we want to work with we have to do two things: (1) summarise the data by month and (2) bring it into the correct output format

res2c <- res2b %>% 
  # lets nest by month
  nest_by(month) %>%
  mutate(
    # lets add a count of how many CustIDs each month contains
    total_count = n_distinct(data$CustID),
    data = list(
        data %>%
          # we only keep distinct rows
              distinct(.keep_all = TRUE) %>%
          # bring the data in long format ...
              pivot_longer(!CustID,
                           names_to = c("var", "key"),
                           names_sep = "__") %>% 
          # ... group by var and key created in the pivot_longer call
              group_by(var, key) %>% 
          # and sum the values to get the final `count` variable
              summarise(count = sum(value, na.rm = TRUE),
                        .groups = "drop") %>%
          # NOTE SURE: is the frequency the count of each category divided by the total count?
              mutate(freq = .data$count / .env$total_count) %>% 
          # lets sort "Blank" at the bottom
              arrange(key == "Blank") %>% 
          # final we use our custom function to split the data.frame in two parts
              split_wide(var) %>% 
          # and last but not least, we need one more row:
              add_row(key = "Both Blank",
                      count = sum(subset(., key == "Blank")$count + 
                                    subset(., key.1 == "Blank")$count,
                                  na.rm = TRUE))
         )) 

Our result is a nested tibble:

res2c
#> # A tibble: 2 x 3
#> # Rowwise:  month
#>   month   data              total_count
#>   <chr>   <list>                  <int>
#> 1 2021-01 <df[,6] [12 × 6]>           3
#> 2 2021-02 <df[,6] [12 × 6]>           1

Each month has its own data.frame in the desired output format:

res2c %>% pull(data)
#> [[1]]
#>           key count      freq   key.1 count.1    freq.1
#> 1         DET     1 0.3333333      DE       0 0.0000000
#> 2    DET(All)     2 0.6666667 DE(All)       1 0.3333333
#> 3         RTD     0 0.0000000      KL       0 0.0000000
#> 4    RTD(All)     1 0.3333333 KL(All)       1 0.3333333
#> 5         SPR     0 0.0000000      KR       1 0.3333333
#> 6    SPR(All)     1 0.3333333 KR(All)       3 1.0000000
#> 7       Blank     1 0.3333333      SR       0 0.0000000
#> 8        <NA>    NA        NA SR(All)       1 0.3333333
#> 9        <NA>    NA        NA      ZT       0 0.0000000
#> 10       <NA>    NA        NA ZT(All)       1 0.3333333
#> 11       <NA>    NA        NA   Blank       0 0.0000000
#> 12 Both Blank     0        NA    <NA>      NA        NA
#> 
#> [[2]]
#>           key count freq   key.1 count.1 freq.1
#> 1         DET     0    0      DE       0      0
#> 2    DET(All)     0    0 DE(All)       0      0
#> 3         RTD     0    0      KL       0      0
#> 4    RTD(All)     1    1 KL(All)       0      0
#> 5         SPR     0    0      KR       1      1
#> 6    SPR(All)     1    1 KR(All)       1      1
#> 7       Blank     0    0      SR       0      0
#> 8        <NA>    NA   NA SR(All)       0      0
#> 9        <NA>    NA   NA      ZT       0      0
#> 10       <NA>    NA   NA ZT(All)       0      0
#> 11       <NA>    NA   NA   Blank       0      0
#> 12 Both Blank     0   NA    <NA>      NA     NA

Created on 2021-04-21 by the reprex package (v0.3.0)

Upvotes: 1

AnilGoyal
AnilGoyal

Reputation: 26218

For the first part library janitor will be helpful (even for second part)

first part

library(tidyverse)
library(janitor)
df %>% mutate(DATE = as.Date(DATE)) %>% select(1:3) %>%
  unique() %>%
  tabyl(TYPE, DATE) %>%
  adorn_totals("row") %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(2) %>%
  adorn_ns("front")

  TYPE  2021-01-02  2021-01-03
  Demo 1  (25.00%) 0   (0.00%)
   Pro 3  (75.00%) 1 (100.00%)
 Total 4 (100.00%) 1 (100.00%)

Upvotes: 1

Related Questions