Reputation: 37
I am looking to sum the values within each of the 'M' columns that have the same character in any of 'Y' columns, so if my data frame looks like this:
X M.1 M.2 M.3 Y.1 Y.2 Y.3
K3 21 6 11 L N X
K8 31 1 29 N
K2 8 0 2 L Q Z
I would like to get this output data frame:
Y M.1 M.2 M.3
L 29 6 13
N 52 7 40
Q 8 0 2
X 21 6 11
Bonus if it can include all the values in column X that include the specific character in the 'Y' column into one column, looking like this:
Y M.1 M.2 M.3 X.all
L 29 6 13 K3,K2
N 52 7 40 K3,K8
Q 8 0 2 K2
X 29 6 13 K3
So far, using aggregate() function below I can get the sum of each value in the "Y' column individually, but appreciate a better way to make an entirely new data frame with all the sums together
aggregate(cbind(df$M.1) ~ df$Y.1, data = df, sum)
Thanks so much for help with this!
Upvotes: 0
Views: 1875
Reputation: 26343
1) dplyr
Here is a dplyr
approach but starting with reshape
(which has great functionality but the way its arguments work is impossible to remember).
library(dplyr)
(df2 <- reshape(df1, varying = c("Y.1", "Y.2", "Y.3"), direction = "long") %>%
group_by(Y))
# A tibble: 9 x 7
# Groups: Y [6]
# X M.1 M.2 M.3 time Y id
#* <chr> <int> <int> <int> <dbl> <chr> <int>
#1 K3 21 6 11 1 L 1
#2 K8 31 1 29 1 N 2
#3 K2 8 0 2 1 L 3
#4 K3 21 6 11 2 N 1
#5 K8 31 1 29 2 NA 2
#6 K2 8 0 2 2 Q 3
#7 K3 21 6 11 3 X 1
#8 K8 31 1 29 3 NA 2
#9 K2 8 0 2 3 Z 3
Now we can use left_join
after we summarised X
and c("M.1", "M.2", "M.3")
separately.
left_join(
summarise_at(df2, c("M.1", "M.2", "M.3"), .funs = sum),
mmarise(df2, X = toString(unique(X)))
)
# A tibble: 6 x 5
# Y M.1 M.2 M.3 X
# <chr> <int> <int> <int> <chr>
#1 L 29 6 13 K3, K2
#2 N 52 7 40 K8, K3
#3 Q 8 0 2 K2
#4 X 21 6 11 K3
#5 Z 8 0 2 K2
#6 NA 62 2 58 K8
2) base R
Starting from df2
here is the same idea in base R
but note that the NA
s got lost.
df2 <- reshape(df1, varying = c("Y.1", "Y.2", "Y.3"), direction = "long")
merge(
aggregate(cbind(M.1, M.2, M.3) ~ Y, df2, sum),
aggregate(X ~ Y, df2, toString)
)
# Y M.1 M.2 M.3 X
#1 L 29 6 13 K3, K2
#2 N 52 7 40 K8, K3
#3 Q 8 0 2 K2
#4 X 21 6 11 K3
#5 Z 8 0 2 K2
3) data.table
library(data.table)
setDT(df1)
df2 <- melt(df1, measure.vars = patterns("Y."), value.name = "Y")
# I'm sure there must be a cleverer way than this
df2[df2[, .(X = toString(unique(X))), by = Y], lapply(.SD, sum), .SDcols = c("M.1", "M.2", "M.3"), by = Y, on = "Y"]
data
df1 <- structure(list(X = c("K3", "K8", "K2"), M.1 = c(21L, 31L, 8L),
M.2 = c(6L, 1L, 0L), M.3 = c(11L, 29L, 2L), Y.1 = c("L",
"N", "L"), Y.2 = c("N", NA, "Q"), Y.3 = c("X", NA, "Z")), .Names = c("X",
"M.1", "M.2", "M.3", "Y.1", "Y.2", "Y.3"), class = "data.frame", row.names = c(NA,
-3L))
Upvotes: 0
Reputation: 16842
If you want to use tidyverse
functions, you can do some wrangling in a few steps. I'm breaking it down to see the intermediary results.
About missing values: that's up to you. You didn't dput
data, so when I read in your data as text with readr::read_table2
, the blanks are automatically converted to NA
. Here I'm keeping those missing values.
So first, tidyr::gather
gets you a long-shaped data frame, first with Y.1
, etc in a single column:
library(dplyr)
library(tidyr)
df %>%
gather(key, value = Y, Y.1:Y.3) %>%
head()
#> # A tibble: 6 x 6
#> X M.1 M.2 M.3 key Y
#> <chr> <int> <int> <int> <chr> <chr>
#> 1 K3 21 6 11 Y.1 L
#> 2 K8 31 1 29 Y.1 N
#> 3 K2 8 0 2 Y.1 L
#> 4 K3 21 6 11 Y.2 N
#> 5 K8 31 1 29 Y.2 <NA>
#> 6 K2 8 0 2 Y.2 Q
A second gather
puts Y
s and M
s into 2 columns:
df %>%
gather(key, value = Y, Y.1:Y.3) %>%
gather(key2, value = M, M.1:M.3) %>%
head()
#> # A tibble: 6 x 5
#> X key Y key2 M
#> <chr> <chr> <chr> <chr> <int>
#> 1 K3 Y.1 L M.1 21
#> 2 K8 Y.1 N M.1 31
#> 3 K2 Y.1 L M.1 8
#> 4 K3 Y.2 N M.1 21
#> 5 K8 Y.2 <NA> M.1 31
#> 6 K2 Y.2 Q M.1 8
Then you can group, create a column with the pasted strings such as K2,K3
, and add up the numeric values. I put x.all
in the grouping so it wouldn't get dropped after summarizing.
df %>%
gather(key, value = Y, Y.1:Y.3) %>%
gather(key2, value = M, M.1:M.3) %>%
group_by(Y) %>%
mutate(x.all = sort(X) %>% unique() %>% paste(collapse = ",")) %>%
group_by(Y, key2, x.all) %>%
summarise(sum = sum(M, na.rm = T)) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: Y, key2 [6]
#> Y key2 x.all sum
#> <chr> <chr> <chr> <int>
#> 1 L M.1 K2,K3 29
#> 2 L M.2 K2,K3 6
#> 3 L M.3 K2,K3 13
#> 4 N M.1 K3,K8 52
#> 5 N M.2 K3,K8 7
#> 6 N M.3 K3,K8 40
Then bring it back into a wide shape with columns for the different M
variables:
df %>%
gather(key, value = Y, Y.1:Y.3) %>%
gather(key2, value = M, M.1:M.3) %>%
group_by(Y) %>%
mutate(x.all = sort(X) %>% unique() %>% paste(collapse = ",")) %>%
group_by(Y, key2, x.all) %>%
summarise(sum = sum(M, na.rm = T)) %>%
spread(key = key2, value = sum)
#> # A tibble: 6 x 5
#> # Groups: Y [6]
#> Y x.all M.1 M.2 M.3
#> <chr> <chr> <int> <int> <int>
#> 1 L K2,K3 29 6 13
#> 2 N K3,K8 52 7 40
#> 3 Q K2 8 0 2
#> 4 X K3 21 6 11
#> 5 Z K2 8 0 2
#> 6 <NA> K8 62 2 58
Created on 2018-10-17 by the reprex package (v0.2.1)
Upvotes: 1