R language: how to return and print a list of missing entries based on two columns

Question

I'm struggling to write R code that prints a "list of dates that do not have data between given start and end dates for all the possible values of another variable / column in a table". It's a little difficult to explain in words, so I'll give a very simplified example that will hopefully make it clear what I'm trying to do.

You are the manager of a pet store and in charge of checking the quality of pet food sales data. The data comes in a csv file with four columns; date, type of animal food, sales price, and quantity sold. The animal_type column can have 3 possible values; dog, cat, or bird in string format.

I've simulated the first three days worth of data for the month of December in a very simplified manner below. The price and quantity columns aren't relevant and so I've left them blank.

date	animal_type	price	quantity
2021-12-01	dog
2021-12-01	dog
2021-12-01	cat
2021-12-01	bird
2021-12-02	dog
2021-12-02	bird
2021-12-03	cat
2021-12-03	cat
2021-12-03	cat

What I'm trying to do is print out / return the dates that don't have entries for all the possible values in the animal_type column. So for my example, what I'm looking to print out is something like...

2021-12-02  :  ['cat']
2021-12-03  :  ['dog', 'bird']

Because [2021-12-02] doesn't have an entry for 'cat' and [2021-12-03] doesn't have entries for 'dog' or 'bird' in the data. However, I've only been able to get a count of the number of unique animal_type values for each date so far with the following functions.

import(tidyverse)
import(dplyr)

df %>% group_by(date) %>% summarise(n = n_distinct(unique(animal_type))) # sums the number of unique animal_type appearing in all the entries for every date
df %>% group_by(animal_type) %>% summarise(n = n_distinct(unique(date))) # sums the number of unique dates that appear in all the entries for every animal_type

# output for "sums the number of unique animal_type appearing in all the entries for every date"
   date            n
          
1 2021-12-01       3
2 2021-12-02       2
3 2021-12-03       1

# output for "sums the number of unique dates that appear in all the entries for every animal_type"
  animal_type   num_dates
           
1 dog             2
2 cat             2
3 bird            2

This can me tell which dates have missing animal_type values but not which one(s) specifically. I've tried looking around but couldn't find many similar problems and so I'm wondering how feasible this would be. I'm also rusty with using R and relearning much of the syntax, packages, and libraries. So I could be missing something simple. I'm open to both tidyverse / dplyr and base r advice as you can likely see from my code. I would appreciate any help and thank you guys for your time!

Daniel Ordo&#241;ez · Accepted Answer

You can use both the tidyr::complete function and an anti-join.

First you have to complete the implicit missing values and then anti-join the completed tibble with the one you currently have.

See the example below

library(tidyverse)
example <- crossing("Date"=c("2021-12-01", "2021-12-02", "2021-12-03"), 
         "Pet"=c("Bird", "Cat", "Dog"))

op_example <- example %>% slice(-c(5, 7, 9))
op_example %>% complete(Date, Pet) %>% 
  anti_join(op_example)

R language: how to return and print a list of missing entries based on two columns

Answers (1)

Related Questions