Sharath
Sharath

Reputation: 2267

Replace wrong values in df2 with true values in df1 by using 2 common columns in R

I have 2 data frames like this

TEAM <- c("PE","PE","MPI","TDT","HPT")
EmpID <- c (444452,444456,16822,339862,14828)    
ManagerID <- c(11499,11599,11899,11339,11559)
CODE <- c("F",NA,"A","H","G")
df1 <- data.frame(TEAM,EmpID,ManagerID,CODE)

TEAM <- c("MPI","TDT","HPT","PE","TDT","PE","MPI","TDT","HPT","PE")
EmpID <- c(444452,444452,444452,339862,339862,16822,339862,16822,14828,14828)
ManagerID <- c(11499,11499,11499,11339,11339,11899,11339,11899,11559,11559)
CODE <- c("A234","H665","G654","F616","H626","F234","H695","G954","G616",NA)
df2 <- data.frame(TEAM,EmpID,ManagerID,CODE)

I am trying to update the wrong values of ManagerID & EmpID in df2 with the true values of ManagerID & EmpID in df1 only when the TEAM & the CODE (matching the letter in CODE column in df1 with the first letter of CODE column in df2). If the team matches but the code is not correct, then the wrong values stay and shouldn't be replaced with the values from df1.

My desired output is

   TEAM  EmpID ManagerID CODE
1   MPI  16822     11899 A234
2   TDT 339862     11339 H665
3   HPT  14828     11559 G654
4    PE 444452     11499 F616
5   TDT 339862     11339 H626
6    PE 444452     11499 F234
7   MPI 339862     11339 H695
8   TDT  16822     11899 G954
9   HPT  14828     11559 G616
10   PE 444452     11599 <NA>

You can see that the row 7 & 8 remain unchanged since the codes don't match.

I tried doing it this way with the help from Gregor for my previous question

df2$ManagerID = df1$ManagerID[match(substr(df2$CODE, 1, 1), df1$CODE)]
df2$EmpID = df1$EmpID [match(substr(df2$CODE, 1, 1), df1$CODE)]

I am not sure if I am headed in the right direction. Kindly help me with inputs on how efficiently to solve this.

Upvotes: 3

Views: 85

Answers (1)

Chris
Chris

Reputation: 6372

Since you are doing text subsetting here, I would initialize as character vectors, not factors:

df1 <- data.frame(TEAM,EmpID,ManagerID,CODE, stringsAsFactors = FALSE)
df2 <- data.frame(TEAM,EmpID,ManagerID,CODE, stringsAsFactors = FALSE)

I would then use data table (not necessary, but cleaner):

library(data.table)

# convert data frames to data.table
setDT(df1)
setDT(df2)

You want to create a column with the value to merge here, just taking the first character of df2's CODE

df2[ , C_SHORT := substr(CODE,1,1)]

We then merge the two data frames on the TEAM/CODE combination. This creates NAs where any matches do not exist. Then, test if any of the columns are NA, and insert the initial value if they are

merge(x = df2,y = df1, by.x = c("TEAM","C_SHORT"), by.y = c("TEAM","CODE"), all.x = TRUE)[
  ,
  .(
    TEAM,
    EmpID = ifelse(is.na(EmpID.y), EmpID.x, EmpID.y), 
    ManagerID = ifelse(is.na(ManagerID.y), ManagerID.x, ManagerID.y),
    CODE
  )
  ]
    TEAM  EmpID ManagerID CODE
 1:  HPT  14828     11559 G654
 2:  HPT  14828     11559 G616
 3:  MPI  16822     11899 A234
 4:  MPI 339862     11339 H695
 5:   PE 444456     11599   NA
 6:   PE 444452     11499 F616
 7:   PE 444452     11499 F234
 8:  TDT  16822     11899 G954
 9:  TDT 339862     11339 H665
10:  TDT 339862     11339 H626

One note: You are using NA here as a lookup here. This works in the merge function (which I did not realize), but IMO this is bad practice (NA in R really refers to missing data, whereas here it encodes something). I would think about changing how this is represented in your data.

Upvotes: 1

Related Questions