Stefan Hansen
Stefan Hansen

Reputation: 631

Data management with several variables

Currently I am facing the following problem, which I'm working in Stata to solve. I have added the algorithm tag, because it's mainly the steps that I'm interested in rather than the Stata code.

I have some variables, say, var1 - var20 that can possibly contain a string. I am only interested in some of these strings, let us call them A,B,C,D,E,F, but other strings can occur also (all of these will be denoted X). Also I have a unique identifier ID. A part of the data could look like this:

ID  |  var1  |  var2  |  var3  |  ..  |  var20  
1   |   E    |        |        |      |    X
1   |        |   A    |        |      |    C
2   |   X    |   F    |   A    |      |   
8   |        |        |        |      |    E

Now I want to create an entry for every ID and for every occurrence of one of the strings A,B,C,E,D,F in any of the variables. The above data should look like this:

ID  |  var1  |  var2  |  var3  |  ..  |  var20
1   |    E   |        |        |  ..  |       
1   |        |    A   |        |      |       
1   |        |        |        |      |    C
2   |        |    F   |        |      |
2   |        |        |    A   |      |
8   |        |        |        |      |    E

Here we ignore every time there's a string X that is NOT A,B,C,D,E or F. My attempt so far was to create a variable that for each entry counts the number, N, of occurrences of A,B,C,D,E,F. In the original data above that variable would be N=1,2,2,1. Then for each entry I create N duplicates of this. This results in the data:

ID  |  var1  |  var2  |  var3  |  ..  |  var20  
1   |   E    |        |        |      |    X
1   |        |   A    |        |      |    C
1   |        |   A    |        |      |    C
2   |   X    |   F    |   A    |      |   
2   |   X    |   F    |   A    |      |   
8   |        |        |        |      |    E

My problem is how do I attack this problem from here? And sorry for the poor title, but I couldn't word it any more specific.

Upvotes: 2

Views: 168

Answers (1)

Richard Herron
Richard Herron

Reputation: 10102

Sorry, I thought the finally block was your desired output (now I understand that it's what you've accomplished so far). You can get the middle block with two calls to reshape (long, then wide).

First I'll generate data to match yours.

clear
set obs 4

* ids
generate n = _n
generate id = 1 in 1/2
replace id = 2 in 3
replace id = 8 in 4

* generate your variables
forvalues i = 1/20 {
    generate var`i' = ""
}
replace var1 = "E" in 1
replace var1 = "X" in 3
replace var2 = "A" in 2
replace var2 = "F" in 3
replace var3 = "A" in 3
replace var20 = "X" in 1
replace var20 = "C" in 2
replace var20 = "E" in 4

Now the two calls to reshape.

* reshape to long, keep only desired obs, then reshape to wide
reshape long var, i(n id) string   
keep if inlist(var, "A", "B", "C", "D", "E", "F")
tempvar long_id
generate int `long_id' = _n
reshape wide var, i(`long_id') string

The first reshape converts your data from wide to long. The var specifies that the variables you want to reshape to long all start with var. The i(n id) specifies that each unique combination of n and i is a unique observation. The reshape call provides one observation for each n-id combination for each of your var1 through var20 variables. So now there are 4*20=80 observations. Then I keep only the strings that you'd like to keep with inlist().

For the second reshape call var specifies that the values you're reshaping are in variable var and that you'll use this as the prefix. You wanted one row per remaining letter, so I made a new index (that has no real meaning in the end) that becomes the i index for the second reshape call (if I used n-id as the unique observation, then we'd end up back where we started, but with only the good strings). The j index remains from the first reshape call (variable _j) so the reshape already knows what suffix to give to each var.

These two reshape calls yield:

. list n id var1 var2 var3 var20

     +-------------------------------------+
     | n   id   var1   var2   var3   var20 |
     |-------------------------------------|
  1. | 1    1      E                       |
  2. | 2    1             A                |
  3. | 2    1                            C |
  4. | 3    2             F                |
  5. | 3    2                    A         |
     |-------------------------------------|
  6. | 4    8                            E |
     +-------------------------------------+

You can easily add back variables that don't survive the two reshapes.

* if you need to add back dropped variables
forvalues i =1/20 {
    capture confirm variable var`i'
    if _rc {
        generate var`i' = ""
    }
}

Upvotes: 1

Related Questions