Kenji
Kenji

Reputation: 581

How to attach variables of one observation to another within the same dataset in R

My problem is as follows:

Suppose that I have a dataset of person-years with information on marital status (cStatus), race, year, spouse id (pID) and city of residence:

data<-data.frame(cbind(c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),c(rep(c(1,2,3,4,5),5)),c(NA,NA,NA,NA,NA,NA,NA,3,3,NA,NA,NA,2,2,7,6,6,6,6,6,NA,NA,NA,NA,NA),c(0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0),c(1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1),c(rep(1,5),rep(1,2),rep(2,3),rep(2,4),1,rep(1,5),rep(1,5))))
names(data)<- c("id","year","pID","cStatus","race","city")

> head(data)  
id year pID cStatus race city
 1    1    NA       0    1    1
 1    2    NA       0    1    1
 1    3    NA       0    1    1
 1    4    NA       0    1    1
 1    5    NA       0    1    1
 2    1    NA       0    1    1

I want to create a variable for spouse's race (pRace) for each observation. I am trying to do it for a given observation "A" by searching in the dataset for other observation "B" which "id" has the same value as "pID" of observation "A" and assigning "B"s "race" to "A"s pRace.

I thought about looping:

for(i in unique(data$id)){
  data[data$id==i,]$pRace <-ifelse(data[data$id==i,]$cStatus==1,data[data$id==data[data$id==i,]$pID,]$race,NA)
}

This, besides being very ugly, yields:

Warning messages:
1: In `[<-.data.frame`(`*tmp*`, data$id == i, , value = list(id = c(1,  :
  provided 8 variables to replace 7 variables

Is there any way of doing it right? Am I in the right way by trying to use loops?

Upvotes: 1

Views: 2242

Answers (1)

Andy
Andy

Reputation: 4659

# create a dataframe that has unique entries for each person and their race
spouses <- unique(data[c("id", "race")])
names(spouses) <- c("pID", "pRace")

# merge race via spouse id
data <- merge(data, spouses, by="pID", all.x=TRUE)

Which gives:

> data
   pID id year cStatus race city pRace
1    2  3    4       1    0    2     1
2    2  3    3       1    0    2     1
3    3  2    4       1    1    2     0
4    3  2    3       1    1    2     0
5    6  4    2       1    0    1    NA
6    6  4    1       1    0    1    NA
7    6  4    3       1    0    1    NA
8    6  4    5       1    0    1    NA
9    6  4    4       1    0    1    NA
10   7  3    5       1    0    1    NA
11  NA  1    1       0    1    1    NA
12  NA  1    2       0    1    1    NA
[...]

Notice you get NA's where either the pID doesn't itself exist in the data (no corresponding id) or where there is no spouse.

If you wanted to account for changing spouses over years, just add years to both the helper dataframe as well as the merge arguments. Like @joran points out, merge can accept multiple columns to merge on, similar to SQL.

# create a dataframe that has unique entries for each person and their race
spouses.yearly <- unique(data[c("id", "year", "race")])
names(spouses.yearly) <- c("pID", "pRace")

# merge race via spouse id
data <- merge(data, spouses.yearly, by=c("pID", "year"), all.x=TRUE)

Upvotes: 3

Related Questions