Reputation: 581
My problem is as follows:
Suppose that I have a dataset of person-years with information on marital status (cStatus), race, year, spouse id (pID) and city of residence:
data<-data.frame(cbind(c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),c(rep(c(1,2,3,4,5),5)),c(NA,NA,NA,NA,NA,NA,NA,3,3,NA,NA,NA,2,2,7,6,6,6,6,6,NA,NA,NA,NA,NA),c(0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0),c(1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1),c(rep(1,5),rep(1,2),rep(2,3),rep(2,4),1,rep(1,5),rep(1,5))))
names(data)<- c("id","year","pID","cStatus","race","city")
> head(data)
id year pID cStatus race city
1 1 NA 0 1 1
1 2 NA 0 1 1
1 3 NA 0 1 1
1 4 NA 0 1 1
1 5 NA 0 1 1
2 1 NA 0 1 1
I want to create a variable for spouse's race (pRace) for each observation. I am trying to do it for a given observation "A" by searching in the dataset for other observation "B" which "id" has the same value as "pID" of observation "A" and assigning "B"s "race" to "A"s pRace.
I thought about looping:
for(i in unique(data$id)){
data[data$id==i,]$pRace <-ifelse(data[data$id==i,]$cStatus==1,data[data$id==data[data$id==i,]$pID,]$race,NA)
}
This, besides being very ugly, yields:
Warning messages:
1: In `[<-.data.frame`(`*tmp*`, data$id == i, , value = list(id = c(1, :
provided 8 variables to replace 7 variables
Is there any way of doing it right? Am I in the right way by trying to use loops?
Upvotes: 1
Views: 2242
Reputation: 4659
# create a dataframe that has unique entries for each person and their race
spouses <- unique(data[c("id", "race")])
names(spouses) <- c("pID", "pRace")
# merge race via spouse id
data <- merge(data, spouses, by="pID", all.x=TRUE)
Which gives:
> data
pID id year cStatus race city pRace
1 2 3 4 1 0 2 1
2 2 3 3 1 0 2 1
3 3 2 4 1 1 2 0
4 3 2 3 1 1 2 0
5 6 4 2 1 0 1 NA
6 6 4 1 1 0 1 NA
7 6 4 3 1 0 1 NA
8 6 4 5 1 0 1 NA
9 6 4 4 1 0 1 NA
10 7 3 5 1 0 1 NA
11 NA 1 1 0 1 1 NA
12 NA 1 2 0 1 1 NA
[...]
Notice you get NA's where either the pID
doesn't itself exist in the data (no corresponding id
) or where there is no spouse.
If you wanted to account for changing spouses over years, just add years to both the helper dataframe as well as the merge
arguments. Like @joran points out, merge
can accept multiple columns to merge on, similar to SQL.
# create a dataframe that has unique entries for each person and their race
spouses.yearly <- unique(data[c("id", "year", "race")])
names(spouses.yearly) <- c("pID", "pRace")
# merge race via spouse id
data <- merge(data, spouses.yearly, by=c("pID", "year"), all.x=TRUE)
Upvotes: 3