Is there a way to calculate percentages comparing observations?

Question

My data set looks like this in Stata

YEAR	Gender	Presidents
2023	Male	5
2023	Female	6
2023	Total	11
2022	Male	3
2022	Female	2
2022	Total	5

I want to be able to create a new variable/column that gives me the percentages for that year by taking the fraction of male/total and female/total for that particular year. In other words, I want my dataset to have an additional column that looks like this:

YEAR	Gender	Presidents	Perc_pres
2023	Male	5	0.45
2023	Female	6	0.55
2023	Total	11	1
2022	Male	3	0.6
2022	Female	2	0.4
2022	Total	5	1

I tried the egen perc_pres_`gender' = total(cond(gender==Female), 1, .) command to create new 2 new columns for female total and male total and then took percentages but that seems too long a way to get what I want. I just want one column to give me the percentages for both men and women and no additional columns.

Nick Cox · Accepted Answer

Generally, a data layout such as yours is an example of spreadsheet thinking that doesn't match standard good practice in statistical software. Totals based on some observations (rows) should not be held in other observations in the same dataset, but in other variables.

Specifically, your expression Gender == Female at best makes sense if Female is another variable in the dataset. It would be illegal for either of the two other (and more obvious) possibilities, that Gender is a string variable, in which case "Female" is needed, or that Gender is a numeric variable with value labels, in which case a different comparison is needed.

Your apparent use of a local macro gender also doesn't make sense to me.

You are confusing percents (which add to 100) with proportions or fractions (which add to 1).

The code needed here is awkward largely because your data layout is awkward.

Here is one solution, which hinges on the fact that Female Male Total sort in that order.

* Example generated by -dataex-. For more info, type help dataex
clear
input int year str6 gender byte presidents
2023 "Male"    5
2023 "Female"  6
2023 "Total"  11
2022 "Male"    3
2022 "Female"  2
2022 "Total"   5
end

bysort year (gender) : gen percent = 100 * presidents / presidents[_N]

list , sepby(year)

     +-------------------------------------+
     | year   gender   presid~s    percent |
     |-------------------------------------|
  1. | 2022   Female          2         40 |
  2. | 2022     Male          3         60 |
  3. | 2022    Total          5        100 |
     |-------------------------------------|
  4. | 2023   Female          6   54.54546 |
  5. | 2023     Male          5   45.45454 |
  6. | 2023    Total         11        100 |
     +-------------------------------------+

If that doesn't help (enough), please use dataex to give an example (and see the stata tag wiki).

Is there a way to calculate percentages comparing observations?

Answers (1)

Related Questions