Reputation: 179
I want to do rowsum in r based on column names.
I have more than 50 columns and have looked at various solutions, including this.
However, this doesn't really answer my question. I have column names such as:
total_2012Q1, total_2012Q2, total_2012Q3, total_2012Q4
,..., up to total_2014Q4
, and other character variables. I want to add rows by year, so in the end, I would have three year columns: total_2012, total_2013, total_2014
.
I don't want to rowsum and select something like ..sample[,2:5]
. Is there a way I can sum them without manually going through column numbers? Also, split.default is an option but if there are character variables as well, how do you deal only the int variables you want to sum up?
simple reproducible example (pre):
id total_2012Q1 total_2012Q2 total_2013Q1 total_2013Q2 char1 char2
1 1231 5455 1534 2436 N Y
2 3948 1239 223 994 Y N
reproducible example (post):
id total_2012 total_2013 char1 char2
1 6686 3970 N Y
2 5187 1217 Y N
Thanks for any suggestions.
Upvotes: 0
Views: 553
Reputation: 16862
The way I would approach this with tidyverse
functions is to reshape the data into a long format so you have a column of total_2012Q1
, total_2012Q2
, etc. You can then separate that into years and quarters, where quarters are marked as being the last two characters in each string:
library(dplyr)
library(tidyr)
df %>%
gather(key, value, starts_with("total")) %>%
separate(key, into = c("year", "quarter"), sep = -2)
#> # A tibble: 8 x 6
#> id char1 char2 year quarter value
#> <dbl> <chr> <chr> <chr> <chr> <dbl>
#> 1 1 N Y total_2012 Q1 1231
#> 2 2 Y N total_2012 Q1 3948
#> 3 1 N Y total_2012 Q2 5455
#> 4 2 Y N total_2012 Q2 1239
#> 5 1 N Y total_2013 Q1 1534
#> 6 2 Y N total_2013 Q1 223
#> 7 1 N Y total_2013 Q2 2436
#> 8 2 Y N total_2013 Q2 994
After that, you can group by your identifiers and year, sum up the values, and reshape it back to a wide format.
df %>%
gather(key, value, starts_with("total")) %>%
separate(key, into = c("year", "quarter"), sep = -2) %>%
group_by_at(vars(id:year)) %>%
summarise(value = sum(value)) %>%
spread(key = year, value = value)
#> # A tibble: 2 x 5
#> # Groups: id, char1, char2 [2]
#> id char1 char2 total_2012 total_2013
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 N Y 6686 3970
#> 2 2 Y N 5187 1217
An approach like this, particularly gathering with starts_with("total")
rather than hard-coded column names or column positions, lets you scale to larger datasets with many more columns.
Upvotes: 1
Reputation: 51592
You can use split.default
, i.e.
sapply(split.default(df, sub('^.*_([0-9]+)Q[0-9]', '\\1', names(df))), rowSums)
# 2012 2013
#[1,] 3 23
#[2,] 7 37
#[3,] 9 49
DATA:
dput(df)
structure(list(total_2012Q1 = c(1, 2, 3), total_2012Q2 = c(2,
5, 6), total_2013Q1 = c(12, 15, 16), total_2013Q2 = c(11, 22,
33)), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 5