Check if each value of one variable matches any value of another variable

Question

I would like something like egen newvar = anymatch(oldvar1), values(oldvar2). This would create an indicator variable that shows if a particular value in oldvar2 is contained anywhere in oldvar1. Except AFAIK for anymatch the values only accepts integer numlist.

This would have many uses, but here's my first example: I have a large country pair dataset. I have a subset of Country1 with a particular property. I want to identify the Country2's that match any member of the subset.

Let's see if I can produce a toy example:

     +--------------------------------------------------+
     | pair   Country1   value_C1   Country2   valC1_g2 |
     |--------------------------------------------------|
  1. |   AB          A          1          B          0 |
  2. |   AC          A          2          C          0 |
  3. |   BA          B          3          A          1 |
  4. |   BC          B          4          C          1 |
  5. |   CA          C          5          A          1 |
     |--------------------------------------------------|
  6. |   CB          C          6          B          1 |
     +--------------------------------------------------+

The value_C1 variable relates to properties of Country1. The variable valC1_g2 indicates that value_C1 is > 2. I would like a variable that indicates if each specific variable in Country2 has a match somewhere in the list of Country1 if valC1_g2.

This seems like it would not be a rare problem, but I couldn't find anything directly addressing it except this researchgate topic.

This can be done with merge:

preserve // saves current data
tempfile localdata // initialize a temporary new dataset
  keep Country1 valC1_g2 // subsetting allows re-ordering of just these variables
  rename (Country1 valC1_g2) (Country2 valC2_g2) // renaming to match the target
  sort Country2 // this just facilitates the 1:1 merge
  save `localdata'
restore // bring back original data

sort Country2 
merge  Country2 using `localdata' // this re-orders the subset to align with Country2 
  sort Country1 pair  // this resets the dataset to the original order

list, abbreviate(10) separator(0)

     +-------------------------------------------------------------+
     | pair   value_C1   Country1   Country2   valC1_g2   valC2_g2 |
     |-------------------------------------------------------------|
  1. |   AB          1          A          B          0          1 |
  2. |   AC          2          A          C          0          1 |
  3. |   BA          3          B          A          1          0 |
  4. |   BC          4          B          C          1          1 |
  5. |   CA          5          C          A          1          0 |
  6. |   CB          6          C          B          1          1 |
     +-------------------------------------------------------------+

I see a couple of problems with this approach.

First, you'll see my merge syntax uses the old syntax because Country2 does not uniquely identify my data. (I guess that means I'm doing m:m in the new syntax? The documentation of merge says that's a no-no.) The pair variable plays that role, but I can't merge by pair or I won't get the re-ordering I need. Maybe I'm worrying too much?

Second, how do I check if it worked? Afterwards I would like to compare the country IDs in Country1 and Country2 to see if they have the same set of values. compare won't work because the values aren't lined up row-wise. The only thing I can think to do is compare after reordering using merge again, but that is effectively just undoing what I just did.

I would appreciate any suggestions for a better approach.

Romalpa Akzo · Accepted Answer

a variable that indicates if Country2 is in the list of Country1 if valC1_g2.

This description of yours is not really clear and might lead to the different understandings:

Some could be interpreted it as Country1 in the corresponding observation (pair) has valC1_g2=1 (like in Pearly Spencer's code, which is elegant and "enjoyable", if you might be familiar with mata).
Meanwhile, others might think it as Country2 in the relevant observation (pair) is the same with a Country1 somewhere, and that Country1 has its (max) valC1_g2=1. Notice that (max) is mentioned here since with your small sample, it could not be sure with the assumption that any Country1 should have identical valC1_g2 in any observations.

For me, your trying code implies the second understanding. And if it is such case, adding 1 more line into your code could help you out. Actually, there are some other (and might be more elegant) ways to do, but this one might save your time of following it up.

preserve
tempfile localdata
keep Country1 valC1_g2

*(Addedline)
collapse (max) valC1_g2 if valC1_g2, by(Country1)

rename (Country1 valC1_g2) (Country2 valC2_g2)
save `localdata'
restore
sort Country2

* (It is better to use the new merge syntax, despite your old one might work)
merge m:1 Country2 using `localdata', keep(master match) nogen

EDIT. Your edited version, which confirms the second understanding, has made it a little bit clearer. One more suggestion to improve the transparency is that the "Afterward part" (comparing 2 Countries ID) should not be put in the same line with "Second part" (checking the logic of the code for the first part). They are 2 different issues.

Below code, which still utilizes -merge- as the main tool, is expected to serve for your issues, by:

Creating a variable (C2inListC1) to check whether a Country2 is in the list of Country1
If a Country2 is in the List of Country1, its valC2_g2 would have the (max) value from valC1_g2 of the corresponding Country1. (max) is mentioned because, as above discussed, so far, it could not be sure that each Country1 should have identical valC1_g2.
Creating a variable (C1inListC2) to check whether a Country1 is in the list of Country2. Then you could use information of C2inListC1 and C1inListC2 to check whether Country1 and Country2 have the same value sets. Both are the same only when "all Country2 are in the List of Country1" and also "all Country1 are in the List of Country2"

Notice that although there are some other ways (which might be more concise in coding or more elegant, especially for creating C2inListC1), -merge- provides a very clear and straight logic and seems to be the best tool to collect the valC1_g2 information. It should work, merely upon the correct logic. The coding just serves no more than the translation of the logic itself to the coding language.

tempfile ListC1 ListC2

preserve
collapse (max) valC1_g2, by(Country1)
rename (Country1 valC1_g2) (Country2 valC2_g2)
save `ListC1', replace
restore

preserve
contract Country2
ren Country2 Country1
save `ListC2', replace
restore

merge m:1 Country2 using `ListC1', keep(master match) gen(C2inListC1)
merge m:1 Country1 using `ListC2', keep(master match) gen(C1inListC2) keepusing(Country1)

recode C2inListC1 C1inListC2 (3=1) (else=0)
label values C2inListC1 C1inListC2 none

count if C2inListC1==0 | C1inListC2==0
if r(N) == 0{
di "Country1 and Country2 have the same values set"
}
else {
di in red "Country1 and Country2 do NOT have the same values set"
}

Check if each value of one variable matches any value of another variable

Answers (1)

Related Questions