conditional random matching from one data frame into another data frame

Question

I have two data frames. One data frame (Partners.Missing) contains 195 people who are partnered (married, de facto, etc) for which I need to construct the partner, using a random selection from a second data frame (NAsOnly).

The Partners.Missing data frame information is:

 str(Partners.Missing)
 'data.frame':  195 obs. of  8 variables:
  $ V1         : Factor w/ 2 levels "Female","Male": 1 1 1 2 1 1 1 2 2 2 ...
  $ V2         : Factor w/ 9 levels "15 - 17 Years",..: 4 4 7 7 4 4 7 3 7 4 ...
  $ V3         : Factor w/ 1 level "Partnered": 1 1 1 1 1 1 1 1 1 1 ...
  $ V4         : Factor w/ 7 levels "Eight or More Usual Residents",..: 1 1 5 2 1 1 1 1 2 5 ...
  $ V5         : Factor w/ 8 levels "1-9 Hours Worked",..: 8 4 8 6 7 8 7 5 4 6 ...
  $ SEX        : chr  "Male" "Male" "Male" "Female" ...
  $ Ageband    : num  4 4 7 7 4 4 7 3 7 4 ...
  $ Inhabitants: num  8 8 6 5 8 8 8 8 5 6 ...

Because V2 is age-band as a factor, I have created the Ageband variable that is a recode of V2 so that the youngest age group (15 - 17 years) is 1, the next oldest is 2, etc. Inhabitants is a recode of V4, again to construct a numeric variable. Sex is binary "Male"/"Female".

The information on the second data frame (NAsOnly) is:

 str(NAsOnly)
 'data.frame':  762 obs. of  7 variables:
  $ SEX         : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
  $ AGEBAND     : Factor w/ 13 levels "0 - 4 Years",..: 3 3 3 3 3 3 3 3 3 3 ...
  $ RELATIONSHIP: Factor w/ 4 levels "Non-partnered",..: 3 3 3 3 1 1 1 1 1 1 ...
  $ INHABITANTS : Factor w/ 9 levels "Eight or More Usual Residents",..: 7 7 3 2 9 9 9 9 7 7 ...
  $ HRSWORKED   : Factor w/ 9 levels "1-9 Hours Worked",..: 1 8 6 3 1 2 3 6 3 4 ...

I can create new variables so that Ageband and Inhabitants in NAsOnly are the same construction, to use in matching. But I'm stuck on how to match. What I want to do - for each row in Partners.Missing - is to randomly sample an observation from NAsOnly using the following criteria:

opposite SEX (so a "Female" in Partners.Missing will match to a "Male" in NAsOnly)
the "Female" partner (irrespective of which data frame they originate) is in the same age band, or one younger, than the "Male" partner
the number of Inhabitants is an exact match, so that a "Female" from a 5-person household will only match to a "Male" (of the correct age band) from a 5-person household
RELATIONSHIP in NAsOnly can only be "Partnered" ("Non-partnered" and "Not elsewhere included" are also valid variable entries in that data frame)*.

So I want a one-to-one match, and I need the match to be a random draw and not the first available. And do this 195 times, once for each observation in Partners.Missing so that their partner is no longer missing.

I can't use first or last match either, as there could be numerous rows in NAsOnly that match on the basis of my criteria - it has to be a random draw, otherwise the same observations will be draw every time from NAsOnly. Basically, something like random sampling with replacement from NAsOnly. It does not matter whether the sampled observations are used to contruct a third data frame of matches, or whether the sampled observations are added to Partners.Missing as additional columns.

*It has four levels as the original larger data frame had Totals rows, so the fourth (and unused) level is "Total".

Update: I have tried to write a for next loop to do this, but it's not working as intended. The code is:

 for(i in 1:1) {
   row <- Partners.Missing[i,]
   if(row$V1=="Female")
   matched <- data.frame(row$SEX[i]==Partnered.Censored$SEX &
             row$Inhabitants[i]==Partnered.Censored$Inhabitants &
           (row$Ageband[i]==Partnered.Censored$Ageband | row$Ageband[i]==Partnered.Censored$Ageband+1)
   )
   else
   matched <- data.frame(row$SEX[i]==Partnered.Censored$SEX &
           row$Inhabitants[i]==Partnered.Censored$Inhabitants &
           (row$Ageband[i]==Partnered.Censored$Ageband | row$Ageband[i]==Partnered.Censored$Ageband-1)
   )
 }

This outputs a single column into a data frame called matched with TRUE or FALSE as the input in a single column of 277 rows, representing whether that row's index in Partnered.Censored is a match or not. Once I increase i's maximum value to 2 (knowing I have 195 rows), I get NA as output. I have the following problems remaining:

I wish to use the row(s) that matches from Partnered.Censored rather than outputting a boolean result
I then wish to sample randomly from the matching rows to generate the new partner
and then repeat for each row in Partners.Missing.

I also have the problem where increasing the maximum value of i, e.g. to 2, overwrites the single column of TRUE/FALSEvalues withNA`.

conditional random matching from one data frame into another data frame

Answers (1)

Related Questions