user9211845
user9211845

Reputation: 131

Get count of horizontal variables based on condition in R

I have a data frame in R that looks like:

Data<-

RefId       Status     Year       PS Score      Price    Code   StatusID
IU-1        Approved   2020       -             -        23     FT
IU-1        Cancel     2020       -             -        21     CT
IU-2        Approved   2020       -             NA       13     FT
IU-3        Approved   2019       NA            -        40     FT

I need to count the unique RefID whose Status is approved and count of each -, NA and Not Null values of variables starting from column 4 group by year.

Where 1st row corresponding to the year value have the unique count of RefID below that row it would be sum of count or respective variables.

Required output format-

Year         Unique_Count    PS Score       Price        Code       StatusID
2020         2               NA             NA           NA         NA                
NA           0               0              1            0          0
Having "-"   2               2              1            0          0
Not Null     2               0              0            2          2
2019         1               NA             NA           NA         NA
NA           1               1              0            0          0
Having "-"   1               0              1            0          0
Not Null     2               0              0            1          1

Upvotes: 2

Views: 139

Answers (1)

akrun
akrun

Reputation: 887118

Based on the data showed, create a logical column 'i1' where 'Status' is 'Approved' (or may filter as well), then grouped by 'Year', summarise (in the new version of dplyr - summarise can return more than 1 row per group), create a 'categ' column that specifies the various categories of frequencies, then the 'Unique_Count' as a list of values that returns the number of distinct (n_distinct) elements of 'RefId' based on 'i1', count of NA elements (sum(is.na(), count of elements with - (str_detect(...), count of Not NULL (no NA - sum(!is.na(). Similarly, we loop across the columns from PSScore to StatusID, does similar counts in a list, then unnest the list elements and remove the 'Year' column

library(dplyr)
library(stringr)
library(tidyr)
df1 %>%
    mutate(i1 = Status == 'Approved') %>%
    group_by(Year) %>% 
    summarise(categ = c(first(Year), "NA", "Having -", "Not NULL"),
         Unique_Count = list(n_distinct(RefId[i1]), 
                sum(is.na(RefId[i1])), sum(str_detect(RefId[i1], '-')),
                sum(!is.na(RefId[i1]))), 
       across(PSScore:StatusID, ~ list(NA_real_, sum(is.na(.[i1])), 
         sum(.[i1] %in% '-'), sum(!(is.na(.[i1])| .[i1] %in% '-')))), 
         .groups = 'drop') %>%
    unnest(names(.)[-1]) %>% 
    select(-Year)
# A tibble: 8 x 6
#  categ    Unique_Count PSScore Price  Code StatusID
#  <chr>           <int>   <dbl> <dbl> <dbl>    <dbl>
#1 2019                1      NA    NA    NA       NA
#2 NA                  0       1     0     0        0
#3 Having -            1       0     1     0        0
#4 Not NULL            1       0     0     1        1
#5 2020                2      NA    NA    NA       NA
#6 NA                  0       0     1     0        0
#7 Having -            2       2     1     0        0
#8 Not NULL            2       0     0     2        2

data

df1 <- structure(list(RefId = c("IU-1", "IU-1", "IU-2", "IU-3"), 
Status = c("Approved", 
"Cancel", "Approved", "Approved"), Year = c(2020L, 2020L, 2020L, 
2019L), PSScore = c("-", "-", "-", NA), Price = c("-", "-", NA, 
"-"), Code = c(23L, 21L, 13L, 40L), StatusID = c("FT", "CT", 
"FT", "FT")), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 1

Related Questions