syebill
syebill

Reputation: 543

Missing Value Extraction

I have two datasets and their dput is below:

Scenario: df1 contains data on unitprices for certain dates and times. df2 contains the competitor for each station code in df1 along with prices and competitor brand name. df2 does not have all the unit prices for dates and times in df1. I need to somehow merge df2 in df1 such that I get date-time, Product, ServiceType & Brand but NA in unitprices for competitors so I can impute these values after.

What I am trying to do is as follows:

  1. Find out which date-time unit prices are missing in df2 based on values in df1 conditioned on similar Station Code, Product & Service Type. Each station code in df1 has multiple "stationscode" in df2 as there can be be more than one competitor.

  2. Somehow extract and merge this information so I can impute values after.

Let's suppose df1 is

 date StationCode  ServiceType Product unitprice
1 2017-06-01 06:00:00        1002 Self-Service      GG     1.345
2 2017-06-01 07:00:00        1002       Served      GG     1.456
3 2017-06-01 07:00:00        1002 Self-Service      SP     1.012 
dput(df1) 


structure(list(date = structure(c(1496300400,
     1496304000, 1496304000 ), class = c("POSIXct", "POSIXt"), tzone =
     "Etc/GMT+1"), StationCode = c(1002,  1002, 1002), ServiceType =
     structure(c(1L, 2L, 1L), .Label = c("Self-Service",  "Served"), class
     = "factor"), Product = structure(c(1L, 1L, 2L ), .Label = c("GG", "SP"), class = "factor"), unitprice = c(1.345, 
     1.456, 1.012)), class = "data.frame", row.names = c(NA, -3L))

and df2 is

date compstcode StationCode  ServiceType Product unitprice brand
1 2017-06-01 06:00:00       3456        1002 Self-Service      GG     1.425 Shell
2 2017-06-01 06:00:00       1267        1002       Served      SP     1.406    BP
3 2017-06-01 06:00:00       5488        1002 Self-Service      GG     1.011 Total

dput(df2)
    structure(list(date = structure(c(1496300400, 1496300400, 1496300400
), class = c("POSIXct", "POSIXt"), tzone = "Etc/GMT+1"), compstcode = c(3456, 
1267, 5488), StationCode = c(1002, 1002, 1002), ServiceType = structure(c(1L, 
2L, 1L), .Label = c("Self-Service", "Served"), class = "factor"), 
    Product = structure(c(1L, 2L, 1L), .Label = c("GG", "SP"), class = "factor"), 
    unitprice = c(1.425, 1.406, 1.011), brand = structure(c(2L, 
    1L, 3L), .Label = c("BP", "Shell", "Total"), class = "factor")), class = "data.frame", row.names = c(NA, 
-3L))

I want to extract compstcode, uprice and brand from df2 into df1 such that we have the compstcode and brand not null whereas uprice can be null for all dates and times of df1.

Upvotes: 0

Views: 42

Answers (1)

Mako212
Mako212

Reputation: 7292

I think you're essentially looking to join these two data frames based on multiple criteria, but let me know if that doesn't capture what you're trying to do.

library(tidyverse)

df1 %>% left_join(., df2 %>% select(-date), by = c("StationCode", "ServiceType", "Product"))

                 date StationCode  ServiceType Product unitprice.x compstcode unitprice.y brand
1 2017-06-01 06:00:00        1002 Self-Service      GG       1.345       3456       1.425 Shell
2 2017-06-01 06:00:00        1002 Self-Service      GG       1.345       5488       1.011 Total
3 2017-06-01 07:00:00        1002       Served      GG       1.456         NA          NA  <NA>
4 2017-06-01 07:00:00        1002 Self-Service      SP       1.012         NA          NA  <NA>

But you don't have to exclude date, it could similarly be (renaming columns for clarity):

df1 %>% left_join(., df2 %>% rename(compDate = date, compunitprice = unitprice), by = c("StationCode", "ServiceType", "Product"))

Upvotes: 1

Related Questions