Reputation: 1363
I have two data frames. One data frame (Partners.Missing
) contains 195 people who are partnered (married, de facto, etc) for which I need to construct the partner, using a random selection from a second data frame (NAsOnly
).
The Partners.Missing
data frame information is:
str(Partners.Missing)
'data.frame': 195 obs. of 8 variables:
$ V1 : Factor w/ 2 levels "Female","Male": 1 1 1 2 1 1 1 2 2 2 ...
$ V2 : Factor w/ 9 levels "15 - 17 Years",..: 4 4 7 7 4 4 7 3 7 4 ...
$ V3 : Factor w/ 1 level "Partnered": 1 1 1 1 1 1 1 1 1 1 ...
$ V4 : Factor w/ 7 levels "Eight or More Usual Residents",..: 1 1 5 2 1 1 1 1 2 5 ...
$ V5 : Factor w/ 8 levels "1-9 Hours Worked",..: 8 4 8 6 7 8 7 5 4 6 ...
$ SEX : chr "Male" "Male" "Male" "Female" ...
$ Ageband : num 4 4 7 7 4 4 7 3 7 4 ...
$ Inhabitants: num 8 8 6 5 8 8 8 8 5 6 ...
Because V2 is age-band as a factor, I have created the Ageband
variable that is a recode of V2
so that the youngest age group (15 - 17 years) is 1, the next oldest is 2, etc. Inhabitants
is a recode of V4
, again to construct a numeric variable. Sex
is binary "Male"/"Female".
The information on the second data frame (NAsOnly
) is:
str(NAsOnly)
'data.frame': 762 obs. of 7 variables:
$ SEX : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
$ AGEBAND : Factor w/ 13 levels "0 - 4 Years",..: 3 3 3 3 3 3 3 3 3 3 ...
$ RELATIONSHIP: Factor w/ 4 levels "Non-partnered",..: 3 3 3 3 1 1 1 1 1 1 ...
$ INHABITANTS : Factor w/ 9 levels "Eight or More Usual Residents",..: 7 7 3 2 9 9 9 9 7 7 ...
$ HRSWORKED : Factor w/ 9 levels "1-9 Hours Worked",..: 1 8 6 3 1 2 3 6 3 4 ...
I can create new variables so that Ageband
and Inhabitants
in NAsOnly
are the same construction, to use in matching. But I'm stuck on how to match. What I want to do - for each row in Partners.Missing
- is to randomly sample an observation from NAsOnly
using the following criteria:
SEX
(so a "Female" in Partners.Missing
will match to a "Male" in NAsOnly
)Inhabitants
is an exact match, so that a "Female" from a 5-person household will only match to a "Male" (of the correct age band) from a 5-person householdRELATIONSHIP
in NAsOnly
can only be "Partnered" ("Non-partnered" and "Not elsewhere included" are also valid variable entries in that data frame)*.So I want a one-to-one match, and I need the match to be a random draw and not the first available. And do this 195 times, once for each observation in Partners.Missing
so that their partner is no longer missing.
I can't use first or last match either, as there could be numerous rows in NAsOnly
that match on the basis of my criteria - it has to be a random draw, otherwise the same observations will be draw every time from NAsOnly
. Basically, something like random sampling with replacement from NAsOnly
. It does not matter whether the sampled observations are used to contruct a third data frame of matches, or whether the sampled observations are added to Partners.Missing
as additional columns.
*It has four levels as the original larger data frame had Totals rows, so the fourth (and unused) level is "Total".
Update: I have tried to write a for next loop to do this, but it's not working as intended. The code is:
for(i in 1:1) {
row <- Partners.Missing[i,]
if(row$V1=="Female")
matched <- data.frame(row$SEX[i]==Partnered.Censored$SEX &
row$Inhabitants[i]==Partnered.Censored$Inhabitants &
(row$Ageband[i]==Partnered.Censored$Ageband | row$Ageband[i]==Partnered.Censored$Ageband+1)
)
else
matched <- data.frame(row$SEX[i]==Partnered.Censored$SEX &
row$Inhabitants[i]==Partnered.Censored$Inhabitants &
(row$Ageband[i]==Partnered.Censored$Ageband | row$Ageband[i]==Partnered.Censored$Ageband-1)
)
}
This outputs a single column into a data frame
called matched
with TRUE
or FALSE
as the input in a single column of 277 rows, representing whether that row's index in Partnered.Censored
is a match or not. Once I increase i's maximum value to 2 (knowing I have 195 rows), I get NA
as output. I have the following problems remaining:
Partnered.Censored
rather than outputting a boolean resultPartners.Missing
.I also have the problem where increasing the maximum value of i
, e.g. to 2, overwrites the single column of TRUE/
FALSEvalues with
NA`.
Upvotes: 1
Views: 370
Reputation: 1363
This has been top of my mind for the past couple of days, and I appear to have solved the problem using the following code. I'm leaving the question and answer up just in case anyone else needs to do this.
for(i in 1:nrow(Partners.Missing)) {
row <- Partners.Missing[i,]
result <- merge(row, Partnered.Censored, by=c("SEX","Inhabitants"),suffixes=c(".r",".c"))
if (row$V1=="Female") {
result<- subset(result, Ageband.r==Ageband.c | Ageband.r==Ageband.c-1)
}
if (row$V1=="Male") {
result<- subset(result, Ageband.r==Ageband.c | Ageband.r==Ageband.c+1)
}
j <- sample(1:nrow(result),1)
if(i == 1) {
Matched.Partners <- result[j,]
}
if (i > 1) {
Matched.Partners <- rbind(Matched.Partners,result[j,])
}
}
Explaining this code to anyone that needs this answer too, and also to see if the community has a better answer,
For each person in Partners.Missing
a temporary vector is created holding that person's information. A one-to-many join is constructed on the basis of the two variables that will match - the missing person's sex, and the number of inhabitants in the household. Then, depending on whether the person in Partners.Missing
is female or male, the matched results are only retained for potential partners with the correct age band. The code then locates the number of potential partners identified, and generates a random integer between 1 and that number. This is used to extract the randomly matched person and put them into the output data frame. Because the output data frame (Matched.Partners
) does not exist before this code is run, the first loop creates the data frame with its first row. Every other time through, the data frame already exists, so the new match is appended.
I'll not vote up either my question or my answer.
Upvotes: 0