Reputation: 537
My data currently looks like this:
Group Team Ind. 1990 1991
Group 1 Blue a 1 1
Group 1 Blue b 1 1
Group 1 Blue c 1
Group 1 Green a 1 1
Group 1 Green b 1 1
Group 1 Green c 1 1
Group 1 Orange a 1
Group 1 Orange b 1 1
Group 1 Orange c 1 1
Group 2 Black a 1 1
Group 2 Black b 1 1
Group 2 Black c 1 1
Group 3 Grey a 1
Group 3 Grey b 1
Group 3 Grey c 1
Group 3 Yellow a 1 1
Group 3 Yellow b 1 1
Group 3 Yellow c
And I want to summarize it into this type of table:
1990 1991
Group 1 No. of Teams with all ind. (i.e. a,b,AND c) 1 3
Group 1 No. of Teams missing a value in a,b,OR c 2 0
Group 2 No. of Teams with all ind. (i.e. a,b,AND c) 1 1
Group 2 No. of Teams missing a value in a,b,OR c 0 0
Group 3 No. of Teams with all ind. (i.e. a,b,AND c) 0 0
Group 3 No. of Teams missing a value in a,b,OR c 2 2
I tried reshaping my data (and having Year as one variable), and then using the egen command:
egen test=count(Team), by (Group Year Team)
But then it was only counting the total number of rows which are complete for each Year and Group. It did not considering the uniques team in any case.
Upvotes: 0
Views: 239
Reputation: 11102
Suppose in your example data, blanks are missings and the corresponding variables are numeric. Then something similar to what you want can be achieved with:
clear
set more off
input ///
group str10 team str1 ind y1990 y1991
1 Blue a 1 1
1 Blue b 1 1
1 Blue c . 1
1 Green a 1 1
1 Green b 1 1
1 Green c 1 1
1 Orange a . 1
1 Orange b 1 1
1 Orange c 1 1
2 Black a 1 1
2 Black b 1 1
2 Black c 1 1
3 Grey a 1 .
3 Grey b . 1
3 Grey c . 1
3 Yellow a 1 1
3 Yellow b 1 1
3 Yellow c . .
end
list, sepby(group team)
*----- (similar to) what you want -----
// set years
local years 1990 1991
// number of teams per -group-
bysort group (team) : egen numteam = total(team != team[_n-1])
// comply with "and" condition
foreach y of local years {
bysort group team (y`y') : gen and`y' = y`y'[1] == y`y'[_N]
}
// compute counts of "and" conditions
collapse (first) and* numteam, by(group team)
collapse (sum) and* (first) numteam, by(group)
// compute counts of "exclusive or" conditions
foreach y of local years {
gen xor`y' = numteam - and`y'
}
// print
drop numteam
list
For any year, and group team
combination, notice the strategy to check if all individuals are present: sort
the indicator variable and if the first and last observations are the same, then all individuals are present (discard for the moment the case where all values are missing). If they are not the same, meaning a 1
in the first observation and a .
in the last, there is at least one individual not present.
(I leave to you the exercise of getting that into the exact structure you originally ask for.)
But the usual recommendation is to work with data in long form, because most statistical analysis in Stata is easier this way. To do so, reshape
first, and then do the rest:
<snip>
*----- what you want -----
// reshape data
gen i = _n
reshape long y, i(i) j(year)
// number of teams per -group-
bysort group (team) : egen numteam = total(team != team[_n-1])
// comply with "and" condition
rename (ind y) (id indicat)
bysort year group team (indicat) : gen and = indicat[1] == indicat[_N]
// compute counts of "and" conditions
collapse (first) and numteam, by(year group team)
collapse (sum) and (first) numteam, by(year group)
// compute counts of "exclusive or" conditions
gen xor = numteam - and
// pretty print
drop numteam
order group
sort group year
list, sepby(group)
The first piece of code outputs:
. list
+-----------------------------------------------+
| group and1990 and1991 xor1990 xor1991 |
|-----------------------------------------------|
1. | 1 1 3 2 0 |
2. | 2 1 1 0 0 |
3. | 3 0 0 2 2 |
+-----------------------------------------------+
and the second:
. list, sepby(group)
+--------------------------+
| group year and xor |
|--------------------------|
1. | 1 1990 1 2 |
2. | 1 1991 3 0 |
|--------------------------|
3. | 2 1990 1 0 |
4. | 2 1991 1 0 |
|--------------------------|
5. | 3 1990 0 2 |
6. | 3 1991 0 2 |
+--------------------------+
Finally, watch out for teams that have all missing values for any year, e.g. team Red in 1990, below:
+--------------------------------------+
| group team ind y1990 y1991 |
|--------------------------------------|
1. | 1 Blue a 1 1 |
2. | 1 Blue b 1 1 |
3. | 1 Blue c . 1 |
|--------------------------------------|
.
.
.
|--------------------------------------|
19. | 4 Red a . . |
20. | 4 Red b . 1 |
21. | 4 Red c . 1 |
|--------------------------------------|
22. | 4 White a 1 1 |
23. | 4 White b 1 1 |
24. | 4 White c . . |
+--------------------------------------+
as these are treated as if all individuals are present.
You need to establish how the counts behave in such cases. One way to adjust is to use
bysort year group team (indicat) : gen and = indicat[1] == indicat[_N] ///
& !missing(indicat[1])
instead of the corresponding original. After sort
ing, we only need to check if the first value is non-missing in order to discard the case that all values are missing.
Running the code with the modified line, along with my original example data, plus a fourth group with a team of all missings (see above), results in:
. list, sepby(group)
+--------------------------+
| group year and xor |
|--------------------------|
1. | 1 1990 1 2 |
2. | 1 1991 3 0 |
|--------------------------|
3. | 2 1990 1 0 |
4. | 2 1991 1 0 |
|--------------------------|
5. | 3 1990 0 2 |
6. | 3 1991 0 2 |
|--------------------------|
7. | 4 1990 0 2 |
8. | 4 1991 0 2 |
+--------------------------+
That's what I expect, according to the logic I defined: if the team has all missing values, then count it with the exclusive or conditions. If you want to follow a different logic, you need to i) spell it out, and ii) translate it to code.
If, as in your comment, the indicator variables can take on any positive value, you can generalize the rule for counting and conditions. Add a fifth group to the example data:
.
.
.
|--------------------------------------|
25. | 5 Blue a 1 . |
26. | 5 Blue b 1 . |
27. | 5 Blue c 1 . |
28. | 5 Green a 2 . |
29. | 5 Green b 2 . |
30. | 5 Green c . . |
31. | 5 Orange a 3 . |
32. | 5 Orange b 2 . |
33. | 5 Orange c 55 . |
+--------------------------------------+
and modify the corresponding line to:
bysort year group team (indicat) : gen and = !missing(indicat[_N])
I only check the last value (_N
), because after sorting, if the last value is not missing, then there are no missings at all.
Check help subscripting
and help bysort
, if not familiarized with the concepts.
Upvotes: 1