Reputation: 1247
This is my dataframe
year state sex age population deaths
-----------------------------------------------
1970 Cal Male 12 134 45
1970 Cal Female 12 100 20
1980 Pen Male 13 200 10
1980 Pen Female 13 150 50
What I want to do is add the number of deaths and population for each year, and state based on the column sex. And create a new row called Total, with the sum of these values.
I would like to have this:
For each year, example (1970,1980,1990...2050) I have a column called age from 0 to 100. With the number of deaths and population for Male and Female. I want to add the number of population and death for each year,state and age. (keeping the value of age)
I mean I want a third row with the add of Femlaes and Males (Total) and keep the age and year.
year state sex age population deaths
-----------------------------------------------
1970 Cal Male 12 134 45
1970 Cal Female 12 100 20
1970 Cal Total 12 234 65
1980 Pen Male 13 200 10
1980 Pen Female 13 150 50
1980 Pen Total 13 350 60
I tried with this line
df_1 <- setDT(df)[,rbind(.SD,c(.(sex = "Total"),colSums(.SD[,-1]))),state,.SDcols = c("sex", "population", "deaths", "year", "age")]
But is not giving me the right result. How can I do this?
Upvotes: 1
Views: 308
Reputation: 634
Generally the pattern is to store summary variables like this in a new table, which maintains the link between the rows of your data frame and the observations of your original data (see the Tidy Data chapter of R for Data Science). You can do this with a combination of dplyr's group_by()
and summarise()
functions:
library("dplyr")
df <- readr::read_table("
year state sex age population deaths
1970 Cal Male 12 134 45
1970 Cal Female 12 100 20
1980 Pen Male 13 200 10
1980 Pen Female 13 150 50
")
df %>%
group_by(year, state, age) %>%
summarise(population = sum(population),
deaths = sum(deaths),
.groups = "drop") ->
df_total
df_total
#> # A tibble: 2 x 5
#> year state age population deaths
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1970 Cal 12 234 65
#> 2 1980 Pen 13 350 60
If you really want this in the original data table, you can simply use mutate()
to add a sex
column and bind_rows()
the two back together:
df_total <- mutate(df_total, sex = "Total")
bind_rows(df, df_total) %>%
arrange(year, state, sex)
#> # A tibble: 6 x 6
#> year state sex age population deaths
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 1970 Cal Female 12 100 20
#> 2 1970 Cal Male 12 134 45
#> 3 1970 Cal Total 12 234 65
#> 4 1980 Pen Female 13 150 50
#> 5 1980 Pen Male 13 200 10
#> 6 1980 Pen Total 13 350 60
Upvotes: 1
Reputation: 18742
library(purrr)
library(janitor)
library(tidyr)
library(dplyr)
df %>%
dplyr::select(sex, everything()) %>%
dplyr::group_split(year, state, age) %>%
purrr::map_dfr(janitor::adorn_totals) %>%
dplyr::mutate(across(c(year, state, age), ~ ifelse(sex == "Total", NA, .))) %>%
tidyr::fill(everything(), .direction = "down") %>%
dplyr::select(year, state, everything())
FYI, across
is from the dplyr
library.
Output
year state sex age population deaths
1970 Cal Male 12 134 45
1970 Cal Female 12 100 20
1970 Cal Total 12 234 65
1980 Pen Male 13 200 10
1980 Pen Female 13 150 50
1980 Pen Total 13 350 60
How it works
select
statement is to accommodate the behavior of janitor::adorn_totals
, which puts the "Total" label in the first column. So this puts sex
as the first column.group_split
splits the dataframe into a list of dataframes by same year
, state
, and age
.purrr::map_dfr
will map janitor::adorn_totals
over each dataframe in the list, which essentially is creating subtotals, and then row binds back into one dataframe, hence the r
in dfr
.year
, state
, and age
columns will have the same value in the "Total" row. The last select
reorders everything back to the original order.Alternatively, since janitor::adorn_totals
is a bit buggy you can do this using some pivots to create the exact same output:
library(tidyr)
library(dplyr)
df %>%
tidyr::pivot_wider(id_cols = c(year, state, age),
names_from = sex,
values_from = population:deaths) %>%
dplyr::mutate(population_Total = rowSums(across(starts_with("population"))),
deaths_Total = rowSums(across(starts_with("deaths")))) %>%
tidyr::pivot_longer(cols = contains("_"),
names_to = c(".value", "sex"),
names_pattern = c("(.*)_(.*)$"))
FYI, across
, contains
, and starts_with
are exported or from the dplyr
library.
How it works
pivot_wider
re-orients the data so there is one row per year
, state
, and age
. New columns are created appending sex
to the end of the pivoted columns, which are population
and deaths
in this case. This creates columns population_Male
, population_Female
, deaths_Male
, and so on.tidyselect
syntax to sum all the rows starting with "population" and likewise for "deaths", storing them in appropriately named columns appended with "_Total".year
, state
, and age
as the ID columns. It pivots all columns with an underscore: population_Male
, population_Female
, population_Total
, etc. Using regular expressions the column names are parsed by the underscore: population_Total
is parsed into the strings population
and Total
. The former becomes a column name storing values, the latter is a value stored in a column called "sex".Upvotes: 1