Sun
Sun

Reputation: 179

rowsum based on groupings or conditions in r

I want to do in based on column names.

I have more than 50 columns and have looked at various solutions, including this.

However, this doesn't really answer my question. I have column names such as: total_2012Q1, total_2012Q2, total_2012Q3, total_2012Q4 ,..., up to total_2014Q4, and other character variables. I want to add rows by year, so in the end, I would have three year columns: total_2012, total_2013, total_2014.

I don't want to and select something like ..sample[,2:5]. Is there a way I can sum them without manually going through column numbers? Also, is an option but if there are character variables as well, how do you deal only the int variables you want to sum up?

simple reproducible example (pre):

id total_2012Q1 total_2012Q2 total_2013Q1 total_2013Q2 char1 char2
 1         1231         5455         1534         2436    N     Y
 2         3948         1239          223          994    Y     N

reproducible example (post):

id total_2012 total_2013 char1 char2
 1       6686      3970     N     Y
 2       5187      1217     Y     N

Thanks for any suggestions.

Upvotes: 0

Views: 553

Answers (2)

camille
camille

Reputation: 16862

The way I would approach this with tidyverse functions is to reshape the data into a long format so you have a column of total_2012Q1, total_2012Q2, etc. You can then separate that into years and quarters, where quarters are marked as being the last two characters in each string:

library(dplyr)
library(tidyr)

df %>%
  gather(key, value, starts_with("total")) %>%
  separate(key, into = c("year", "quarter"), sep = -2)
#> # A tibble: 8 x 6
#>      id char1 char2 year       quarter value
#>   <dbl> <chr> <chr> <chr>      <chr>   <dbl>
#> 1     1 N     Y     total_2012 Q1       1231
#> 2     2 Y     N     total_2012 Q1       3948
#> 3     1 N     Y     total_2012 Q2       5455
#> 4     2 Y     N     total_2012 Q2       1239
#> 5     1 N     Y     total_2013 Q1       1534
#> 6     2 Y     N     total_2013 Q1        223
#> 7     1 N     Y     total_2013 Q2       2436
#> 8     2 Y     N     total_2013 Q2        994

After that, you can group by your identifiers and year, sum up the values, and reshape it back to a wide format.

df %>%
  gather(key, value, starts_with("total")) %>%
  separate(key, into = c("year", "quarter"), sep = -2) %>%
  group_by_at(vars(id:year)) %>%
  summarise(value = sum(value)) %>%
  spread(key = year, value = value)
#> # A tibble: 2 x 5
#> # Groups:   id, char1, char2 [2]
#>      id char1 char2 total_2012 total_2013
#>   <dbl> <chr> <chr>      <dbl>      <dbl>
#> 1     1 N     Y           6686       3970
#> 2     2 Y     N           5187       1217

An approach like this, particularly gathering with starts_with("total") rather than hard-coded column names or column positions, lets you scale to larger datasets with many more columns.

Upvotes: 1

Sotos
Sotos

Reputation: 51592

You can use split.default, i.e.

sapply(split.default(df, sub('^.*_([0-9]+)Q[0-9]', '\\1', names(df))), rowSums)
#     2012 2013
#[1,]    3   23
#[2,]    7   37
#[3,]    9   49

DATA:

dput(df)
structure(list(total_2012Q1 = c(1, 2, 3), total_2012Q2 = c(2, 
5, 6), total_2013Q1 = c(12, 15, 16), total_2013Q2 = c(11, 22, 
33)), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 5

Related Questions