Damon C. Roberts
Damon C. Roberts

Reputation: 326

Create new data frame from summed column values based on conditions

I currently have a dataframe that looks like this:

census_2
# A tibble: 236,844 x 5
   STATE  RACE CENSUS2010POP POPESTIMATE2012
   <chr> <dbl>         <dbl>           <dbl>
 1 01        1         37991           37824
 2 01        1         38150           38597
 3 01        1         39738           37653
 4 01        1         39827           38113
 5 01        1         39353           39628
 6 01        1         39520           39821
 7 01        1         39813           39425
 8 01        2         39695           39636
 9 01        2         40012           39957
10 01        2         42073           39856
# … with 236,834 more rows, and 1 more variable:
#   POPESTIMATE2016 <dbl>

The State columns are going to be something I want to universally group by. I will have multiple rows representing the same value in the race column. The CENSUS2010POP column is a census population estimate for that particular racial group in that particular state.

So, what I am ultimately wanting to do, is to get a column for each racial group population estimate for the state overall. For example, say in the original dataset, state = 1 & race = 1 refers to a population in State 1 for white constituents. What I want, is to summate the observations in that state for white population estmates and turn that estimate into a column. Then for the third column in the dataset, I'd have a summation of the population estimates in that particular state for non-whites, say it is equivalent to RACE = 2 in the original dataset above. Here is a brief example:

STATE  WHITE_CENSUS10POP                                  NONWHITE_CENSUS10POP
1      = summation(nrow where state == 1 & race == 1)     = summation(nrow where state == 1 & race == 2)
2.     = summation(nrow where state == 2 & race == 1)     = summation(nrow where state == 2 & race == 2)
3
4
5
6
...
50

Upvotes: 1

Views: 148

Answers (1)

akrun
akrun

Reputation: 887961

There are multiple ways to do this. A simple option without having other package dependencies (assuming there are only two 'race') is to do a group by on 'STATE' and summarise with two columns by subsetting the 'CENSUS2010POP', based on the value of 'race' and get the sum.

library(dplyr)
df1 %>%
   group_by(STATE) %>%
   summarise(WHITE_CENSUS10POP = sum(CENSUS2010POP[race == 1]),
             NONWHITE_CENUS10POP = sum(CENSUS2010POP[race == 2]), 
            .groups = 'drop')

A more general option is to get the sum of 'CENSUS2010POP' grouped by both 'STATE', 'race' and then pivot the 'long' format data to 'wide' with pivot_wider

library(tidyr)
df1 %>%
   group_by(STATE, race) %>%
   summarise(value = sum(CENSUS2010POP), .groups = 'drop') %>%
   pivot_wider(names_from = race, values_from = value)

Upvotes: 1

Related Questions