Reputation: 2267
I have 2 data frames like this
TEAM <- c("PE","PE","MPI","TDT","HPT")
EmpID <- c (444452,444456,16822,339862,14828)
ManagerID <- c(11499,11599,11899,11339,11559)
CODE <- c("F",NA,"A","H","G")
df1 <- data.frame(TEAM,EmpID,ManagerID,CODE)
TEAM <- c("MPI","TDT","HPT","PE","TDT","PE","MPI","TDT","HPT","PE")
EmpID <- c(444452,444452,444452,339862,339862,16822,339862,16822,14828,14828)
ManagerID <- c(11499,11499,11499,11339,11339,11899,11339,11899,11559,11559)
CODE <- c("A234","H665","G654","F616","H626","F234","H695","G954","G616",NA)
df2 <- data.frame(TEAM,EmpID,ManagerID,CODE)
I am trying to update the wrong values of ManagerID & EmpID in df2 with the true values of ManagerID & EmpID in df1 only when the TEAM & the CODE (matching the letter in CODE column in df1 with the first letter of CODE column in df2). If the team matches but the code is not correct, then the wrong values stay and shouldn't be replaced with the values from df1.
My desired output is
TEAM EmpID ManagerID CODE
1 MPI 16822 11899 A234
2 TDT 339862 11339 H665
3 HPT 14828 11559 G654
4 PE 444452 11499 F616
5 TDT 339862 11339 H626
6 PE 444452 11499 F234
7 MPI 339862 11339 H695
8 TDT 16822 11899 G954
9 HPT 14828 11559 G616
10 PE 444452 11599 <NA>
You can see that the row 7 & 8 remain unchanged since the codes don't match.
I tried doing it this way with the help from Gregor for my previous question
df2$ManagerID = df1$ManagerID[match(substr(df2$CODE, 1, 1), df1$CODE)]
df2$EmpID = df1$EmpID [match(substr(df2$CODE, 1, 1), df1$CODE)]
I am not sure if I am headed in the right direction. Kindly help me with inputs on how efficiently to solve this.
Upvotes: 3
Views: 85
Reputation: 6372
Since you are doing text subsetting here, I would initialize as character vectors, not factors:
df1 <- data.frame(TEAM,EmpID,ManagerID,CODE, stringsAsFactors = FALSE)
df2 <- data.frame(TEAM,EmpID,ManagerID,CODE, stringsAsFactors = FALSE)
I would then use data table (not necessary, but cleaner):
library(data.table)
# convert data frames to data.table
setDT(df1)
setDT(df2)
You want to create a column with the value to merge here, just taking the first character of df2's CODE
df2[ , C_SHORT := substr(CODE,1,1)]
We then merge the two data frames on the TEAM/CODE
combination. This creates NA
s where any matches do not exist. Then, test if any of the columns are NA
, and insert the initial value if they are
merge(x = df2,y = df1, by.x = c("TEAM","C_SHORT"), by.y = c("TEAM","CODE"), all.x = TRUE)[
,
.(
TEAM,
EmpID = ifelse(is.na(EmpID.y), EmpID.x, EmpID.y),
ManagerID = ifelse(is.na(ManagerID.y), ManagerID.x, ManagerID.y),
CODE
)
]
TEAM EmpID ManagerID CODE
1: HPT 14828 11559 G654
2: HPT 14828 11559 G616
3: MPI 16822 11899 A234
4: MPI 339862 11339 H695
5: PE 444456 11599 NA
6: PE 444452 11499 F616
7: PE 444452 11499 F234
8: TDT 16822 11899 G954
9: TDT 339862 11339 H665
10: TDT 339862 11339 H626
One note: You are using NA
here as a lookup here. This works in the merge function (which I did not realize), but IMO this is bad practice (NA
in R really refers to missing data, whereas here it encodes something). I would think about changing how this is represented in your data.
Upvotes: 1