Joey
Joey

Reputation: 67

Merge datasets based on closest lowest date

Machines = c('A', 'A', 'B', 'C', 'D', 'A', 'B')
Week = c('201651',  '201635', '201714', '201516', '201438', '201704', '201651')
Manual.Code = c('M123', 'M467', 'M123', 'M846', 'M898', 'QQQQ' ,'M898')
DF1 = data.frame(Machines, Week, Manual)

Manual.Code = c('M123', 'M123', 'M123', 'CTTY', 'M467', 'M.99', 'M846', 'M898') 
Version = as.numeric(c('2.0', '1.1', '1.0', '2.0', '1.6', '1.67', '3.2', '2.4'))
Release_week = (c('201711', '201638', '201612', '201424', '201552', '201345', 
'201719', '201647'))
DF2 = data_frame(Manual.Code, Version, Release_week)

I have been using R for some months now, but I still consider myself a beginner. My next problem involves the usage of a manual for maintenance on machine types at a certain point in time. DF1 are my machines and DF2 are the versions of the manual used.

DF1:

Type  Week    Manual.Code
A     201651  M123
A     201635  M467
B     201714  M123
C     201516  M846
D     201438  M898
A     201704  QQQQ
B     201651  M898

DF2: 

Manual.Code Version Release_week
M123        2.0     201711
M123        1.1     201638
M123        1.0     201612
CTTY        2.0     201424
M467        1.6     201552
M.99        1.67    201345
M846        3.2     201719 
M898        2.4     201647

Desired output DF3: 
Type  Week    Manual.Code  Used.Version
A     201651  M123         1.1 
A     201635  M467         1.6
B     201714  M123         2.0
C     201516  M846         NA
D     201438  M898         NA
A     201704  QQQQ         NA
B     201651  M898         2.4

I have had a look at other threads and from that I think that the solution is in either rolling joins, data.tables using unequal-joins, or a fuzzyjoin from the fuzzyjoin package, but I haven't managed to get either one of these solutions to work yet. What would be the best way to merge these two datasets?

p.s. sizes real datasets:

DF1: 35000 x 43
DF2: 217000 x 14

Upvotes: 1

Views: 95

Answers (3)

Uwe
Uwe

Reputation: 42544

The data.table package allows for rolling joins:

library(data.table)
setDT(DF2)[setDT(DF1), on = .(Manual.Code, Release_week = Week), roll = +Inf]
   Manual.Code Version Release_week Machines
1:        M123     1.1       201651        A
2:        M467     1.6       201635        A
3:        M123     2.0       201714        B
4:        M846      NA       201516        C
5:        M898      NA       201438        D
6:        QQQQ      NA       201704        A
7:        M898     2.4       201651        B

Please note that only integer, double or character colums may be roll joined. So, make sure to use the stringsAsFactors = FALSE parameter when calling data.frame().

The result can be fine-tuned to comply with OP's expected result:

library(data.table)
result <- setDT(DF2)[setDT(DF1), on = .(Manual.Code, Release_week = Week), roll = +Inf]
setnames(result, c("Release_week", "Version"), c("Week", "Used.Version"))
setcolorder(result, names(DF1))
result
   Machines   Week Manual.Code Used.Version
1:        A 201651        M123          1.1
2:        A 201635        M467          1.6
3:        B 201714        M123          2.0
4:        C 201516        M846           NA
5:        D 201438        M898           NA
6:        A 201704        QQQQ           NA
7:        B 201651        M898          2.4

Data

Note that the data as provided by the OP are used but with stringsAsFactors = FALSE.

Machines = c('A', 'A', 'B', 'C', 'D', 'A', 'B')
Week = c('201651',  '201635', '201714', '201516', '201438', '201704', '201651')
Manual.Code = c('M123', 'M467', 'M123', 'M846', 'M898', 'QQQQ' ,'M898')
DF1 = data.frame(Machines, Week, Manual.Code, stringsAsFactors = FALSE)

Manual.Code = c('M123', 'M123', 'M123', 'CTTY', 'M467', 'M.99', 'M846', 'M898') 
Version = as.numeric(c('2.0', '1.1', '1.0', '2.0', '1.6', '1.67', '3.2', '2.4'))
Release_week = (c('201711', '201638', '201612', '201424', '201552', '201345', 
'201719', '201647'))
DF2 = data.frame(Manual.Code, Version, Release_week, stringsAsFactors = FALSE)

Upvotes: 1

zx8754
zx8754

Reputation: 56159

Using dplyr, left_join, then arrange them by absolute (abs) difference between weeks, and get the first one (slice), finally assign Version to NA where Release_week is later than Week:

library(dplyr)

DF1 %>% 
  mutate(rn = row_number()) %>% 
  left_join(DF2, by = "Manual.Code") %>% 
  #if you have different column names use this:
  #left_join(DF2, by = c("Manual" = "Manual.Code")) %>% 
  group_by(rn) %>% 
  arrange(rn, abs(Release_week - Week)) %>% 
  slice(1) %>% 
  ungroup() %>% 
  select(-rn) %>% 
  mutate(Version = ifelse(Release_week > Week, NA, Version))

# # A tibble: 7 x 5
#   Type    Week Manual.Code Version Release_week
#   <chr>  <int> <chr>         <dbl>        <int>
# 1 A     201651 M123            1.1       201638
# 2 A     201635 M467            1.6       201552
# 3 B     201714 M123            2         201711
# 4 C     201516 M846           NA         201719
# 5 D     201438 M898           NA         201647
# 6 A     201704 QQQQ           NA             NA
# 7 B     201651 M898            2.4       201647

Example input data:

DF1 <- read.table(text = "
Type  Week    Manual.Code
A     201651  M123
A     201635  M467
B     201714  M123
C     201516  M846
D     201438  M898
A     201704  QQQQ
B     201651  M898
", header = TRUE, stringsAsFactors = FALSE)

DF2 <- read.table(text = "
Manual.Code Version Release_week
M123        2.0     201711
M123        1.1     201638
M123        1.0     201612
CTTY        2.0     201424
M467        1.6     201552
M.99        1.67    201345
M846        3.2     201719 
M898        2.4     201647
", header = TRUE, stringsAsFactors = FALSE)

Upvotes: 3

Benjamin Schlegel
Benjamin Schlegel

Reputation: 527

I think I found a solution to your problem. However I get a different output then you when just taking the preceding weeks. What I do is to look up the closest preceding week with a function and then make a LEFT JOIN with that column together with the manual.

options(stringsAsFactors = FALSE)
df1 = data.frame(type = c("A","A","B","C","D","A","B"),
             week = c(201651, 201635, 201714, 201516, 201438, 201704, 201651),
             manual.code = c("M123", "M467", "M123", "M846", "M898", "QQQQ", "M898"))

df2 = data.frame(manual.code = c("M123","M123","M123","CTTY","M467","M.99","M846", "M898"),
             version = c("2.0","1.1","1.0", "2.0", "1.6", "1.67", "3.2", "2.4"),
             release_week = c(201711,201638,201612,201424,201652,201345,201719,201647))


df3 = df1
df3$closest_week = NA

get_clostest_week = function(manual, week){
  manual_weeks = df2[which(df2$manual.code == manual & df2$release_week <= week),"release_week"]
  if(length(manual_weeks)==0){
    return(NA)
  }
  diff = abs(manual_weeks - week)
  manual_weeks[which(diff == min(diff))]
}

for(i in seq_len(nrow(df3))){
  df3$closest_week[i] = get_clostest_week(df3$manual.code[i], df3$week[i])
}



df3 = merge(df3, df2, by.x = c("manual.code","closest_week"), 
        by.y = c("manual.code","release_week"), all.x = TRUE)))

(The answer has been edited because of misunderstanding of the problem.)

Upvotes: 0

Related Questions