Reputation: 1671
In R, I have two data frames A & B as follows-
Name Age City Gender Income Company ...
JXX 21 Chicago M 20K XYZ ...
CXX 25 NewYork M 30K PQR ...
CXX 26 Chicago M NA ZZZ ...
Age City Gender Avg Income Avg Height Avg Weight ...
21 Chicago M 30K ... ... ...
25 NewYork M 40K ... ... ...
26 Chicago M 50K ... ... ...
I want to fill missing values in data frame A from data frame B.
For example, for third row in data frame A I can substitute avg income from data frame B instead of exact income. I don't want to merge these two data frames, instead want to perform look-up like operation using Age, City and Gender columns.
Upvotes: 0
Views: 767
Reputation: 3
You can simply use the following to update the average income of the city from B to the income in A.
dataFrameA$Income = dataFrameB$`Avg Income`[match(dataFrameA$City, dataFrameB$City)]
you'll have to use "`" if the column name has a space
this is similar to using a lookup using index and match in excel. I'm assuming you're coming from excel. The code will be more compact if you use data.table
Upvotes: 0
Reputation: 35324
library(data.table);
## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS NA
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX NA
Bt;
## Age City Gender Avg Income
## 1: 62 NewYork M NA
## 2: 51 Chicago F 60K
## 3: 31 Chicago M 50K
## 4: 27 NewYork M NA
## 5: 23 Chicago M 60K
I generated some random test data for demonstration purposes. I'm quite happy with the result I got with seed 5, which covers many cases:
And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order.
## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS 60K
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX 50K
In the above I filter for NA values in A first, then do a join in the j
argument on the key columns and assign in-place the source column to the target column using the data.table :=
syntax.
Note that in the data.table world X[Y]
does a right join, so if you want a left join you need to reverse it to Y[X]
(with "left" now referring to X
, counter-intuitively). That's why I used Bt[.SD]
instead of (the likely more natural expectation of) .SD[Bt]
. We need a left join on .SD
because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column.
You can repeat the in-place assignment line for each column you want to replace.
## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
## Ai Bi
## 1 2 5
## 2 5 3
## 3 4 2
## 4 3 1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
## Age City Gender Name Income
## 2 50 NewYork F OOO <NA>
## 5 23 Chicago M SSS 60K
## 3 62 NewYork M VVV <NA>
## 6 51 Chicago F FFF 90K
## 4 31 Chicago M XXX 50K
I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called merge()
to join them (note that this is an inner join, since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. This effectively precomputes the joined pairs of rows for all subsequent modification operations.
For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, e.g. that its Income
value is NA for the Income
replacement. We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement.
As before, you can repeat the assignment line for every column you want to replace.
Upvotes: 1
Reputation: 2489
So I think this works for Income. If there are only those 3 columns, you could substitute the names of the other columns in:
df1<-read.table(header = T, stringsAsFactors = F, text = "
Name Age City Gender Income Company
JXX 21 Chicago M 20K XYZ
CXX 25 NewYork M 30K PQR
CXX 26 Chicago M NA ZZZ")
df2<-read.table(header = T, stringsAsFactors = F, text = "
Age City Gender Avg_Income
21 Chicago M 30K
25 NewYork M 40K
26 Chicago M 50K ")
df1[is.na(df1$Income),]$Income<-df2[is.na(df1$Income),]$Avg_Income
It wouldn't surprise me if one of the regulars has a better way that prevents you from having to re-type the names of the columns.
Upvotes: 1