Reputation: 477
I have below-mentioned dataframe in R.
CusID Date Type LogCat PriceCode
C-1 2021-01-02 14:13:10 Demo
C-2 2021-01-02 13:17:07 Pro SPR, DET, RTD KR
C-2 2021-01-02 14:15:10 Pro SPR, DET, RTD SR
C-2 2021-01-02 16:14:08 Pro SPR, DET, RTD DE
C-3 2021-01-02 17:11:03 Pro DET KL
C-3 2021-01-02 12:14:24 Pro DET ZT
C-3 2021-01-02 12:33:34 Pro DET KR
C-4 2021-01-02 10:43:55 Pro KR
C-5 2021-01-03 20:23:35 Pro SPR, RTD KR
Dupt<-
structure(list(CustID = c("C-1", "C-2",
"C-2", "C-2", "C-3", "C-3",
"C-3", "C-4", "C-5"), DATE = c("2021-01-02 14:13:10", "2021-01-02 13:17:07", "2021-01-02 14:15:10", "2021-01-02 16:14:08", "2021-01-02 17:11:03", "2021-01-02 12:14:24", "2021-01-02 12:33:34", "2021-01-02 10:43:55", "2021-01-03 20:23:35"), TYPE = c("Demo",
"Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro"
), LogCat = c(NA, "SPR,DET,RTD", "SPR,DET,RTD", "SPR,DET,RTD",
"DET", "DET", "DET", NA, " SPR, RTD "), PriceCode = c(NA,"KR", "SR", "DE", "KL", "ZT", "KR", "KR", "KR")), class = "data.frame", row.names = c(NA,
-9L))
By using the above dataframe, I need to create following two different output dataframe based on Date and Month. For Example, let take Date 2021-01-02.
Required Dataframe1<-
Total 4 100.00%
Demo 1 25.00%
Pro 3 75.00%
For Required Dataframe 2 we need to consider only those CusID
which has Type
not equal to Demo
. We need to get the following count for LogCat
and PriceCode
for Such CustId
. The thing to note here is the value in LogCat
and PriceCode
are not standard and can change from date to date so that we can't make it hard code.
The requirement here is to count the value mentioned in LogCate
and PriceCode
uniquely for each uniuqe CustID
.
For Example, the DET
text is uniquely available in CustID
C-3 without any other comma separated value and similarly KR
text is uniquely available in CustID
C-4 without having any other value in ProceCode
column for that particular (C-4
) CustID
.
The (All)
concatenated with each unique LogCat
and PriceCode
value implies that it has present with combination with other value in that particular column.
The last NA
column is static for only in case if there are some CustID
having no LogCat
and PriceCode
value but still has Type
as Pro.
Required Dataframe2<-
LogCat Count % PriceCode Count %
SPR 0 0.00% KR 1 50.00%
SPR (All) 1 50.00% KR (All) 3 100.00%
DET 1 50.00% SR 0 0.00%
DET (All) 2 100.00% SR (All) 1 33.33%
RTD 0 0.00% DE 0 0.00%
RTD (All) 1 50.00% DE (All) 1 33.33%
Blank 1 33.33% ZT 0 0.00%
- - - ZT (All) 1 33.33%
- - - KL 0 0.00%
- - - KL (All) 1 33.33%
- - - Blank 0 0.00%
Both Blank 0 0.00% - 0 0.00%
Upvotes: 4
Views: 546
Reputation: 18541
I added some comments and explanation to my answer. I think this approach should work, but a few things are not clear to me:
You want a summary for each month, but your toy data only contains dates in January 2021 (I changed one entry to at least have two month to work with).
How should the final output format look like? You describe how one data.frame
for one month should look like, but how should the final ouput for all month look like? I opted for a nested tibble
where each month has its own summary data.frame
.
I assume that the variable ending with "(All)"
contains the count of all occurrences. The variables not ending with "(All)"
contain the count for the case the category is the only one mentioned. Is this correct?
I do not understand how the final frequencies, especially for LogCat
are calculated. I come to the same numbers as your example output for PriceCode
when dividing the count of each category by the total number of CustID
s. But when looking at your frequencies of LogCat
it seems like DET (All)
is 100%
(although we have 3 CustID
s) suggesting that all (All)
variables show a relative frequency which should always be 100%
. DET
has count 1
and a frequency of 50%
. So in this case it seems to be the relative count. But then RTD (All)
has a count of 1
and a frequency of 50%
. This frequency is relative to what? The total number if CustID
s on that day for TYPE != "Demo"
is 3
.
Apart from this, the code below, although somewhat longish, should work, and should be easily adjustable to other definitions of frequency.
Setup:
# The setup contains the initial data slightly adjusted so that it contains
# two different month
#
# we use the `tidyverse` library and some custom helper functions
# initial data: slightly changed so that it contains two different month
dat <- structure(
list(
CustID = c("C-1", "C-2",
"C-2", "C-2", "C-3", "C-3",
"C-3", "C-4", "C-5"),
DATE = c(
"2021-01-02 14:13:10",
"2021-01-02 13:17:07",
"2021-01-02 14:15:10",
"2021-01-02 16:14:08",
"2021-01-02 17:11:03",
"2021-01-02 12:14:24",
"2021-01-02 12:33:34",
"2021-01-02 10:43:55",
"2021-02-01 20:23:35" # I changed this to february to have two different month
),
TYPE = c("Demo",
"Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro"),
LogCat = c(
NA,
"SPR,DET,RTD",
"SPR,DET,RTD",
"SPR,DET,RTD",
"DET",
"DET",
"DET",
NA,
" SPR, RTD "
),
PriceCode = c(NA, "KR", "SR", "DE", "KL", "ZT", "KR", "KR", "KR")
),
class = "data.frame",
row.names = c(NA,-9L)
)
# helper functions
# some additional helper functions to get the desired output format:
# adds `length` nrows of NA's to a data.frame
extend_df <- function(df, length) {
fill_df <- df[seq_len(length), ]
fill_df[,] <- NA
rbind(df, fill_df)
}
# cbind (column bind) two data.frames with different number of rows
# by appending NA with `extend_df()`
async_cbind <- function(df1, df2) {
df1_r <- nrow(df1)
df2_r <- nrow(df2)
if (df1_r < df2_r) {
df1 <- extend_df(df1, df2_r - df1_r)
return(cbind(df1, df2))
} else if (df1_r > df2_r) {
df2 <- extend_df(df2, df1_r - df2_r)
return(cbind(df1, df2))
} else {
cbind(df1, df2)
}
}
# split a data.frame by `cbind`ing each group
split_wide <- function(df, g) {
v <- as.character(substitute(g))
df_ls <- split(df[, -which(names(df) == v)], df[[v]])
out <- Reduce(async_cbind, df_ls)
names(out) <- make.names(names(out), unique = TRUE)
out
}
library(tidyverse)
To get the data.frame
of the first part:
res1 <- dat %>%
# convert `TYPE` to `factor`
mutate(TYPE = as.factor(TYPE)) %>%
# `nest_by()` year and month with `substr` which gives us a nested `tibble`
# with one `tibble` per month
nest_by(month = paste(substr(DATE, 1, 4), substr(DATE, 6,7), sep = "-")) %>%
# lets take each nested `tibble` and nest it again by appending the same data
# but with `TYPE` set as `"Total"`
mutate(data = list(
tibble(dat = list(mutate(data, TYPE = "Total"), data)) %>%
# we continue `rowsie` and summarise the inner nested `tibbles` by `TYPE`
rowwise() %>%
mutate(dat = list(dat %>%
group_by(TYPE, .drop = FALSE) %>%
summarise(n = n_distinct(CustID)) %>%
# let's add the frequency counts
mutate(freq = prop.table(n)))) %>%
# and unnest
unnest(dat)))
This gives us the following nested tibble ...
res1
#> # A tibble: 2 x 2
#> # Rowwise: month
#> month data
#> <chr> <list>
#> 1 2021-01 <tibble [3 × 3]>
#> 2 2021-02 <tibble [3 × 3]>
... where the data
column contains one data.frame
for each month:
res1 %>% pull(data)
#> [[1]]
#> # A tibble: 3 x 3
#> TYPE n freq
#> <chr> <int> <dbl>
#> 1 Total 4 1
#> 2 Demo 1 0.25
#> 3 Pro 3 0.75
#>
#> [[2]]
#> # A tibble: 3 x 3
#> TYPE n freq
#> <chr> <int> <dbl>
#> 1 Total 1 1
#> 2 Demo 0 0
#> 3 Pro 1 1
To get the data.frame
of the second part:
It is preferable to have the data in a different format, so that PriceCode
contains multiple comma separated codes similar to LogCat
.
res2a <- dat %>%
# filter out all rows with `TYPE == "Demo"`
filter(TYPE != "Demo") %>%
# add month variable containing year and month as character string.
mutate(month = paste(substr(DATE, 1, 4), substr(DATE,6,7), sep = "-")) %>%
# group by `CustID` and `month`
group_by(CustID, month) %>%
# collapse `PriceCode` so that format is similar to `LogCat`
mutate(PriceCode = paste(PriceCode, collapse= ",")) %>%
# ungroup and drop `TYPE` and `DATE`
ungroup() %>%
select(!c(TYPE, DATE))
Lets have a look at this first step:
res2a
#> # A tibble: 8 x 4
#> CustID LogCat PriceCode month
#> <chr> <chr> <chr> <chr>
#> 1 C-2 "SPR,DET,RTD" KR,SR,DE 2021-01
#> 2 C-2 "SPR,DET,RTD" KR,SR,DE 2021-01
#> 3 C-2 "SPR,DET,RTD" KR,SR,DE 2021-01
#> 4 C-3 "DET" KL,ZT,KR 2021-01
#> 5 C-3 "DET" KL,ZT,KR 2021-01
#> 6 C-3 "DET" KL,ZT,KR 2021-01
#> 7 C-4 <NA> KR 2021-01
#> 8 C-5 " SPR, RTD " KR 2021-02
In the next step we create all the columns we need for the final output table:
# First, we need to create named vectors to loop over both columns `LogCat` and `PriceCode`
# alternatively we could use dplyover::crossover() (a package I maintain on GitHub)
logcat_cols <-
dat$LogCat[!is.na(dat$LogCat)] %>%
strsplit(",") %>%
unlist %>%
trimws %>%
unique %>%
set_names(., paste0("LogCat__", ., "(All)"))
pricecode_cols <-
dat$PriceCode[!is.na(dat$PriceCode)] %>%
strsplit(",") %>%
unlist %>%
trimws %>%
unique %>%
set_names(., paste0("PriceCode__", ., "(All)"))
res2b <- res2a %>%
mutate(
# we need to keep month and CustID
month = month,
CustID = CustID,
# this gives us one dummy column for each LogCat
map_dfc(logcat_cols, ~ as.integer(grepl(.x, LogCat))),
# this gives us one dummy column for each PriceCode
map_dfc(pricecode_cols, ~ as.integer(grepl(.x, PriceCode))),
# Lets create another set of dummy columns checking if the category was the only one mentioned
# we do this of LogCat ...
across(starts_with("LogCat") & ends_with("(All)"),
list("d" = ~ if_else(rowSums(select(cur_data(), starts_with("LogCat") & ends_with("(All)"))) == .x & .x == 1, 1, 0))),
# ... and for PriceCode, and for now lets append a `_d` as suffix to the variable name
across(starts_with("PriceCode") & ends_with("(All)"),
list("d" = ~ if_else(rowSums(select(cur_data(), starts_with("PriceCode") & ends_with("(All)"))) == .x & .x == 1, 1, 0))),
# Lets add some final columns for when LogCat PriceCode are NA
across(c(LogCat, PriceCode),
~ if_else(is.na(.x), 1, 0),
.names = "{col}__Blank"),
# lets drop the original columns, we don't need them anymore
.keep = "none") %>%
# we still need to delete the suffix `_d` from some columns
rename_with(~ gsub("\\(All\\)_d$", "", .x), ends_with("_d"))
Lets have a glimpse at the data with all needed columns:
res2b %>% glimpse
#> Rows: 8
#> Columns: 20
#> $ month <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021…
#> $ CustID <chr> "C-2", "C-2", "C-2", "C-3", "C-3", "C-3", "C-4", …
#> $ `LogCat__SPR(All)` <int> 1, 1, 1, 0, 0, 0, 0, 1
#> $ `LogCat__DET(All)` <int> 1, 1, 1, 1, 1, 1, 0, 0
#> $ `LogCat__RTD(All)` <int> 1, 1, 1, 0, 0, 0, 0, 1
#> $ `PriceCode__KR(All)` <int> 1, 1, 1, 1, 1, 1, 1, 1
#> $ `PriceCode__SR(All)` <int> 1, 1, 1, 0, 0, 0, 0, 0
#> $ `PriceCode__DE(All)` <int> 1, 1, 1, 0, 0, 0, 0, 0
#> $ `PriceCode__KL(All)` <int> 0, 0, 0, 1, 1, 1, 0, 0
#> $ `PriceCode__ZT(All)` <int> 0, 0, 0, 1, 1, 1, 0, 0
#> $ LogCat__SPR <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ LogCat__DET <dbl> 0, 0, 0, 1, 1, 1, 0, 0
#> $ LogCat__RTD <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ PriceCode__KR <dbl> 0, 0, 0, 0, 0, 0, 1, 1
#> $ PriceCode__SR <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ PriceCode__DE <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ PriceCode__KL <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ PriceCode__ZT <dbl> 0, 0, 0, 0, 0, 0, 0, 0
#> $ LogCat__Blank <dbl> 0, 0, 0, 0, 0, 0, 1, 0
#> $ PriceCode__Blank <dbl> 0, 0, 0, 0, 0, 0, 0, 0
After we have created the data that we want to work with we have to do two things: (1) summarise the data by month and (2) bring it into the correct output format
res2c <- res2b %>%
# lets nest by month
nest_by(month) %>%
mutate(
# lets add a count of how many CustIDs each month contains
total_count = n_distinct(data$CustID),
data = list(
data %>%
# we only keep distinct rows
distinct(.keep_all = TRUE) %>%
# bring the data in long format ...
pivot_longer(!CustID,
names_to = c("var", "key"),
names_sep = "__") %>%
# ... group by var and key created in the pivot_longer call
group_by(var, key) %>%
# and sum the values to get the final `count` variable
summarise(count = sum(value, na.rm = TRUE),
.groups = "drop") %>%
# NOTE SURE: is the frequency the count of each category divided by the total count?
mutate(freq = .data$count / .env$total_count) %>%
# lets sort "Blank" at the bottom
arrange(key == "Blank") %>%
# final we use our custom function to split the data.frame in two parts
split_wide(var) %>%
# and last but not least, we need one more row:
add_row(key = "Both Blank",
count = sum(subset(., key == "Blank")$count +
subset(., key.1 == "Blank")$count,
na.rm = TRUE))
))
Our result is a nested tibble:
res2c
#> # A tibble: 2 x 3
#> # Rowwise: month
#> month data total_count
#> <chr> <list> <int>
#> 1 2021-01 <df[,6] [12 × 6]> 3
#> 2 2021-02 <df[,6] [12 × 6]> 1
Each month has its own data.frame
in the desired output format:
res2c %>% pull(data)
#> [[1]]
#> key count freq key.1 count.1 freq.1
#> 1 DET 1 0.3333333 DE 0 0.0000000
#> 2 DET(All) 2 0.6666667 DE(All) 1 0.3333333
#> 3 RTD 0 0.0000000 KL 0 0.0000000
#> 4 RTD(All) 1 0.3333333 KL(All) 1 0.3333333
#> 5 SPR 0 0.0000000 KR 1 0.3333333
#> 6 SPR(All) 1 0.3333333 KR(All) 3 1.0000000
#> 7 Blank 1 0.3333333 SR 0 0.0000000
#> 8 <NA> NA NA SR(All) 1 0.3333333
#> 9 <NA> NA NA ZT 0 0.0000000
#> 10 <NA> NA NA ZT(All) 1 0.3333333
#> 11 <NA> NA NA Blank 0 0.0000000
#> 12 Both Blank 0 NA <NA> NA NA
#>
#> [[2]]
#> key count freq key.1 count.1 freq.1
#> 1 DET 0 0 DE 0 0
#> 2 DET(All) 0 0 DE(All) 0 0
#> 3 RTD 0 0 KL 0 0
#> 4 RTD(All) 1 1 KL(All) 0 0
#> 5 SPR 0 0 KR 1 1
#> 6 SPR(All) 1 1 KR(All) 1 1
#> 7 Blank 0 0 SR 0 0
#> 8 <NA> NA NA SR(All) 0 0
#> 9 <NA> NA NA ZT 0 0
#> 10 <NA> NA NA ZT(All) 0 0
#> 11 <NA> NA NA Blank 0 0
#> 12 Both Blank 0 NA <NA> NA NA
Created on 2021-04-21 by the reprex package (v0.3.0)
Upvotes: 1
Reputation: 26218
For the first part library janitor
will be helpful (even for second part)
library(tidyverse)
library(janitor)
df %>% mutate(DATE = as.Date(DATE)) %>% select(1:3) %>%
unique() %>%
tabyl(TYPE, DATE) %>%
adorn_totals("row") %>%
adorn_percentages("col") %>%
adorn_pct_formatting(2) %>%
adorn_ns("front")
TYPE 2021-01-02 2021-01-03
Demo 1 (25.00%) 0 (0.00%)
Pro 3 (75.00%) 1 (100.00%)
Total 4 (100.00%) 1 (100.00%)
Upvotes: 1