Reputation: 131
I have a data frame in R that looks like:
Data<-
RefId Status Year PS Score Price Code StatusID
IU-1 Approved 2020 - - 23 FT
IU-1 Cancel 2020 - - 21 CT
IU-2 Approved 2020 - NA 13 FT
IU-3 Approved 2019 NA - 40 FT
I need to count the unique RefID
whose Status
is approved and count of each -
, NA
and Not Null values of variables starting from column 4 group by year.
Where 1st row corresponding to the year value have the unique count of RefID
below that row it would be sum of count or respective variables.
Required output format-
Year Unique_Count PS Score Price Code StatusID
2020 2 NA NA NA NA
NA 0 0 1 0 0
Having "-" 2 2 1 0 0
Not Null 2 0 0 2 2
2019 1 NA NA NA NA
NA 1 1 0 0 0
Having "-" 1 0 1 0 0
Not Null 2 0 0 1 1
Upvotes: 2
Views: 139
Reputation: 887118
Based on the data showed, create a logical column 'i1' where 'Status' is 'Approved' (or may filter
as well), then grouped by 'Year', summarise
(in the new version of dplyr
- summarise can return more than 1 row per group), create a 'categ' column that specifies the various categories of frequencies, then the 'Unique_Count' as a list
of values that returns the number of distinct (n_distinct
) elements of 'RefId' based on 'i1', count of NA elements (sum(is.na(
), count of elements with -
(str_detect(...
), count of Not NULL (no NA - sum(!is.na(
). Similarly, we loop across
the columns from PSScore to StatusID, does similar counts in a list
, then unnest
the list
elements and remove the 'Year' column
library(dplyr)
library(stringr)
library(tidyr)
df1 %>%
mutate(i1 = Status == 'Approved') %>%
group_by(Year) %>%
summarise(categ = c(first(Year), "NA", "Having -", "Not NULL"),
Unique_Count = list(n_distinct(RefId[i1]),
sum(is.na(RefId[i1])), sum(str_detect(RefId[i1], '-')),
sum(!is.na(RefId[i1]))),
across(PSScore:StatusID, ~ list(NA_real_, sum(is.na(.[i1])),
sum(.[i1] %in% '-'), sum(!(is.na(.[i1])| .[i1] %in% '-')))),
.groups = 'drop') %>%
unnest(names(.)[-1]) %>%
select(-Year)
# A tibble: 8 x 6
# categ Unique_Count PSScore Price Code StatusID
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
#1 2019 1 NA NA NA NA
#2 NA 0 1 0 0 0
#3 Having - 1 0 1 0 0
#4 Not NULL 1 0 0 1 1
#5 2020 2 NA NA NA NA
#6 NA 0 0 1 0 0
#7 Having - 2 2 1 0 0
#8 Not NULL 2 0 0 2 2
df1 <- structure(list(RefId = c("IU-1", "IU-1", "IU-2", "IU-3"),
Status = c("Approved",
"Cancel", "Approved", "Approved"), Year = c(2020L, 2020L, 2020L,
2019L), PSScore = c("-", "-", "-", NA), Price = c("-", "-", NA,
"-"), Code = c(23L, 21L, 13L, 40L), StatusID = c("FT", "CT",
"FT", "FT")), class = "data.frame", row.names = c(NA, -4L))
Upvotes: 1