Reputation: 67
Machines = c('A', 'A', 'B', 'C', 'D', 'A', 'B')
Week = c('201651', '201635', '201714', '201516', '201438', '201704', '201651')
Manual.Code = c('M123', 'M467', 'M123', 'M846', 'M898', 'QQQQ' ,'M898')
DF1 = data.frame(Machines, Week, Manual)
Manual.Code = c('M123', 'M123', 'M123', 'CTTY', 'M467', 'M.99', 'M846', 'M898')
Version = as.numeric(c('2.0', '1.1', '1.0', '2.0', '1.6', '1.67', '3.2', '2.4'))
Release_week = (c('201711', '201638', '201612', '201424', '201552', '201345',
'201719', '201647'))
DF2 = data_frame(Manual.Code, Version, Release_week)
I have been using R for some months now, but I still consider myself a beginner. My next problem involves the usage of a manual for maintenance on machine types at a certain point in time. DF1 are my machines and DF2 are the versions of the manual used.
DF1:
Type Week Manual.Code
A 201651 M123
A 201635 M467
B 201714 M123
C 201516 M846
D 201438 M898
A 201704 QQQQ
B 201651 M898
DF2:
Manual.Code Version Release_week
M123 2.0 201711
M123 1.1 201638
M123 1.0 201612
CTTY 2.0 201424
M467 1.6 201552
M.99 1.67 201345
M846 3.2 201719
M898 2.4 201647
Desired output DF3:
Type Week Manual.Code Used.Version
A 201651 M123 1.1
A 201635 M467 1.6
B 201714 M123 2.0
C 201516 M846 NA
D 201438 M898 NA
A 201704 QQQQ NA
B 201651 M898 2.4
I have had a look at other threads and from that I think that the solution is in either rolling joins, data.tables using unequal-joins, or a fuzzyjoin from the fuzzyjoin package, but I haven't managed to get either one of these solutions to work yet. What would be the best way to merge these two datasets?
p.s. sizes real datasets:
DF1: 35000 x 43
DF2: 217000 x 14
Upvotes: 1
Views: 95
Reputation: 42544
The data.table
package allows for rolling joins:
library(data.table)
setDT(DF2)[setDT(DF1), on = .(Manual.Code, Release_week = Week), roll = +Inf]
Manual.Code Version Release_week Machines 1: M123 1.1 201651 A 2: M467 1.6 201635 A 3: M123 2.0 201714 B 4: M846 NA 201516 C 5: M898 NA 201438 D 6: QQQQ NA 201704 A 7: M898 2.4 201651 B
Please note that only integer, double or character colums may be roll joined. So, make sure to use the stringsAsFactors = FALSE
parameter when calling data.frame()
.
The result can be fine-tuned to comply with OP's expected result:
library(data.table)
result <- setDT(DF2)[setDT(DF1), on = .(Manual.Code, Release_week = Week), roll = +Inf]
setnames(result, c("Release_week", "Version"), c("Week", "Used.Version"))
setcolorder(result, names(DF1))
result
Machines Week Manual.Code Used.Version 1: A 201651 M123 1.1 2: A 201635 M467 1.6 3: B 201714 M123 2.0 4: C 201516 M846 NA 5: D 201438 M898 NA 6: A 201704 QQQQ NA 7: B 201651 M898 2.4
Note that the data as provided by the OP are used but with stringsAsFactors = FALSE
.
Machines = c('A', 'A', 'B', 'C', 'D', 'A', 'B')
Week = c('201651', '201635', '201714', '201516', '201438', '201704', '201651')
Manual.Code = c('M123', 'M467', 'M123', 'M846', 'M898', 'QQQQ' ,'M898')
DF1 = data.frame(Machines, Week, Manual.Code, stringsAsFactors = FALSE)
Manual.Code = c('M123', 'M123', 'M123', 'CTTY', 'M467', 'M.99', 'M846', 'M898')
Version = as.numeric(c('2.0', '1.1', '1.0', '2.0', '1.6', '1.67', '3.2', '2.4'))
Release_week = (c('201711', '201638', '201612', '201424', '201552', '201345',
'201719', '201647'))
DF2 = data.frame(Manual.Code, Version, Release_week, stringsAsFactors = FALSE)
Upvotes: 1
Reputation: 56159
Using dplyr, left_join, then arrange them by absolute (abs) difference between weeks, and get the first one (slice), finally assign Version to NA where Release_week is later than Week:
library(dplyr)
DF1 %>%
mutate(rn = row_number()) %>%
left_join(DF2, by = "Manual.Code") %>%
#if you have different column names use this:
#left_join(DF2, by = c("Manual" = "Manual.Code")) %>%
group_by(rn) %>%
arrange(rn, abs(Release_week - Week)) %>%
slice(1) %>%
ungroup() %>%
select(-rn) %>%
mutate(Version = ifelse(Release_week > Week, NA, Version))
# # A tibble: 7 x 5
# Type Week Manual.Code Version Release_week
# <chr> <int> <chr> <dbl> <int>
# 1 A 201651 M123 1.1 201638
# 2 A 201635 M467 1.6 201552
# 3 B 201714 M123 2 201711
# 4 C 201516 M846 NA 201719
# 5 D 201438 M898 NA 201647
# 6 A 201704 QQQQ NA NA
# 7 B 201651 M898 2.4 201647
Example input data:
DF1 <- read.table(text = "
Type Week Manual.Code
A 201651 M123
A 201635 M467
B 201714 M123
C 201516 M846
D 201438 M898
A 201704 QQQQ
B 201651 M898
", header = TRUE, stringsAsFactors = FALSE)
DF2 <- read.table(text = "
Manual.Code Version Release_week
M123 2.0 201711
M123 1.1 201638
M123 1.0 201612
CTTY 2.0 201424
M467 1.6 201552
M.99 1.67 201345
M846 3.2 201719
M898 2.4 201647
", header = TRUE, stringsAsFactors = FALSE)
Upvotes: 3
Reputation: 527
I think I found a solution to your problem. However I get a different output then you when just taking the preceding weeks. What I do is to look up the closest preceding week with a function and then make a LEFT JOIN with that column together with the manual.
options(stringsAsFactors = FALSE)
df1 = data.frame(type = c("A","A","B","C","D","A","B"),
week = c(201651, 201635, 201714, 201516, 201438, 201704, 201651),
manual.code = c("M123", "M467", "M123", "M846", "M898", "QQQQ", "M898"))
df2 = data.frame(manual.code = c("M123","M123","M123","CTTY","M467","M.99","M846", "M898"),
version = c("2.0","1.1","1.0", "2.0", "1.6", "1.67", "3.2", "2.4"),
release_week = c(201711,201638,201612,201424,201652,201345,201719,201647))
df3 = df1
df3$closest_week = NA
get_clostest_week = function(manual, week){
manual_weeks = df2[which(df2$manual.code == manual & df2$release_week <= week),"release_week"]
if(length(manual_weeks)==0){
return(NA)
}
diff = abs(manual_weeks - week)
manual_weeks[which(diff == min(diff))]
}
for(i in seq_len(nrow(df3))){
df3$closest_week[i] = get_clostest_week(df3$manual.code[i], df3$week[i])
}
df3 = merge(df3, df2, by.x = c("manual.code","closest_week"),
by.y = c("manual.code","release_week"), all.x = TRUE)))
(The answer has been edited because of misunderstanding of the problem.)
Upvotes: 0