Karl Johnson
Karl Johnson

Reputation: 111

Add in "empty" rows when joining tables

I am working on an application that monitors worker productivity in R. For each worker I have the jobs they completed and I want to see how many jobs they completed each week of the year ( so weeks 1 - 52). The issue is that some workers don't have any work logged for certain weeks (e.g. Worker 1 might have worked weeks 1 - 48, but not 49 - 52).

The source data is tables for each job top where the observation is the worker, formatted like so:

UniqueID  Date     mapper
10000001 3/3/2015  person1

Starting with this I then break the date up into year, month, day fields and then use lubridate to get the weeks. This is applied to each source data file (of which there are 10).

I then group each table by the user and week and count the number of jobs each worker completed, see code here:


library(tidyverse)
library(readxl)
library(rlang)
library(pryr)
library(lubridate)
library(extrafont)
loadfonts(device = "win")


current_week <- week(Sys.Date())
feature_sheets <- excel_sheets("./typesMerged.xlsx")
list_sheets <- lapply(feature_sheets, function(x) read_excel(path = "./typesMerged.xlsx",col_types = c("numeric", "date", "text"), sheet = x))

users = list('person1', 'person2', 'person3', 'person4', 'person5', 'person6')


createTibble <- function(i, yr, wk){
  is_missing <- missing(wk)
  list_sheets[[i]]%>%
    filter(LASTUSER %in% users ) %>% 
    mutate(year = as.numeric(format(DATEMODIFI, format = "%Y")),
           month = as.numeric(format(DATEMODIFI, format = "%m")),
           week = week(DATEMODIFI),
           day = as.numeric(format(DATEMODIFI, format = "%d")))%>% 
    select(-DATEMODIFI) %>%
    filter(year == yr) %>%
    filter(if(is_missing) TRUE else week == wk) 
}




tableNames <- c('dpd', 'fuse', 'ohprimary', 'ohsecondary', 'poles', 'pv', 'switch','transformers', 'ugprimary', 'ugsecondary' )
features <- vector('list', length(tableNames))


for (feature in seq_along(features)){

  features[[feature]] <- createTibble(i = feature, yr = 2019)


}

# use for iteratively creating tibbles/df if needed
# for (feature in seq_along(features)){
#   assign(features[feature], value = createTibble(feature))
# }


countFeatures <- function(x){
  x %>%
    group_by(LASTUSER, week) %>%
    summarize(n = n())
}


featureCountTibbles <-map(.x = features,.f = countFeatures)

featureCountsByUser <- featureCountTibbles %>% reduce(full_join, by = c("LASTUSER", "week")) %>% 
  rename(mapper = LASTUSER) %>% 
  rename(dpd = n.x) %>%
  rename(fuse = n.y) %>% 
  rename(ohprimary = n.x.x) %>% 
  rename(ohsecondary = n.y.y) %>% 
  rename(poles = n.x.x.x) %>% 
  rename(pv = n.y.y.y) %>% 
  rename(switch = n.x.x.x.x) %>% 
  rename(transformers = n.y.y.y.y) %>% 
  rename(ugprimary = n.x.x.x.x.x) %>% 
  rename(ugsecondary = n.y.y.y.y.y) %>% 
  replace_na(replace = list(dpd = 0, fuse = 0, ohprimary = 0, ohsecondary = 0, poles = 0, pv = 0, switch = 0, transformers = 0, ugprimary = 0, ugsecondary = 0))

this results in a table that looks like this, the problem here is that person1 did no work in week 5 so he doesn't have a row for it:

mapper  week  dpd  fuse  etc.
person1  1    10    50   ...
person1  2     0    50   ...
person1  3    10     0   ...
person1  4    10    50   ...
person1  6    10    50   ...
person2  1    10    50   ...
person2  2    50    50   ...
person2  3    10     0   ...
person2  4    10    50   ...
person2  5    10    50   ***
person2  6    10    50   ...

This is how I want it to look where weeks of no work are seen:

mapper  week  dpd  fuse  etc.
person1  1    10    50   ...
person1  2     0    50   ...
person1  3    10     0   ...
person1  4    10    50   ...
person1  5     0     0   ***
person1  6    10    50   ...
person2  1    10    50   ...
person2  2    50    50   ...
person2  3    10     0   ...
person2  4    10    50   ...
person2  5    10    50   ...
person2  6    10    50   ...

How can I make it so these empty rows will be added? It sounds simple, but I can't really think of a way to do it in R (still kind of new to it). If this was SQL I would add a new field week at the beginning, populate it with 1 - 52 (which would add in all the rows I need) and then update the table with the actual data I have for the weeks.

Solution Implemented:

featureCountsByUser <- featureCountTibbles %>% reduce(full_join, by = c("LASTUSER", "week")) %>%
  rename(mapper = LASTUSER) %>%
  rename(dpd = n.x) %>%
  rename(fuse = n.y) %>% 
  rename(ohprimary = n.x.x) %>% 
  rename(ohsecondary = n.y.y) %>% 
  rename(poles = n.x.x.x) %>% 
  rename(pv = n.y.y.y) %>% 
  rename(switch = n.x.x.x.x) %>% 
  rename(transformers = n.y.y.y.y) %>% 
  rename(ugprimary = n.x.x.x.x.x) %>% 
  rename(ugsecondary = n.y.y.y.y.y) %>% 
  ungroup() %>% 
  complete(mapper,week) %>% 
  group_by(mapper,week) %>% 
  replace_na(replace = list(dpd = 0, fuse = 0, ohprimary = 0, ohsecondary = 0, poles = 0, pv = 0, switch = 0, transformers = 0, ugprimary = 0, ugsecondary = 0)) 

complete() was the answer; but contrary to the docs it did not respect my grouping and due to the nature of how I am counting jobs done I could not complete() on the raw data so this is my work around. Thanks all!

Upvotes: 1

Views: 200

Answers (1)

mrhellmann
mrhellmann

Reputation: 5499

Using the results data you posted that isn't what you want:

library(tidyverse)
x <- c("Worker  Week  dpd  fuse ", 
  "person1  1    10    5   ",
  "person1  2     0    5   ",
  "person1  3    10        ",
  "person1  4    10    5   ",
  "person1  6    10    5   ",
  "person2  1    10    5   ",
  "person2  2    50    5   ",
  "person2  3    10        ",
  "person2  4    10    5   ",
  "person2  5    10    5   ",
  "person2  6    10    5   ") %>%
  read_table()


x %>% complete(Worker, Week)

Should give:

# A tibble: 12 x 4
   Worker   Week   dpd  fuse
   <chr>   <dbl> <dbl> <dbl>
 1 person1     1    10     5
 2 person1     2     0     5
 3 person1     3    10    NA
 4 person1     4    10     5
 5 person1     5    NA    NA
 6 person1     6    10     5
 7 person2     1    10     5
 8 person2     2    50     5
 9 person2     3    10    NA
10 person2     4    10     5
11 person2     5    10     5
12 person2     6    10     5

complete() has a options for filling in missing data, link to reference above by @aosmith. Filling NA with 0 shouldn't be a problem.

Upvotes: 1

Related Questions