Reputation: 1484
I have a question about data.table
's melt
and dcast
for multiple columns. I browsed on StackOverFlow but many similar posts are not what I want. I will explain below.
First, data
is about causes of problems and the value amount. Here is my part of data
:
ID Type Problem1 Value1 Problem2 Value2 Problem3 Value3
1 A X 500 Y 1000 Z 400
2 A X 600 Z 700
3 B Y 700 Z 100
4 B W 200 V 200
5 C Z 500 V 500
6 C X 1000 W 100 V 900
Second, ID
is unique. Type
contains three(A
, B
, and C
). There are 5 problems.
Take ID == 1
as an example. It is Type A
and it contains 3 problems(X
, Y
, and Z
). Its Problem X
has Value 500
, Problem Y
has Value 1000
, Problem Z
has Value 400
. Take ID == 5
as an example. It is Type C
and contains 2 problems(Z
and V
). Its Problem Z
has Value 500
and Problem V
has Value 500
.
Third, column ID
, Type
, Problem1
, Problem2
, and Problem3
are character
. Value1
, Value2
, and Value3
are numeric
.
The result what I want is:
Type X Y Z W V
A 1100 1000 1100 0 0
B 0 700 100 200 200
C 1000 0 500 100 1400
I don't know how to explain here properly. I want to group the Type
and then summation each problem's vlaue. I think this is about long to wide. I found reference here and here. Second one may be useful. However, I have no idea where to begin. Any suggestions?
# data
dt <- fread("
ID Type Problem1 Value1 Problem2 Value2 Problem3 Value3
1 A X 500 Y 1000 Z 400
2 A X 600 Z 700
3 B Y 700 Z 100
4 B W 200 V 200
5 C Z 500 V 500
6 C X 1000 W 100 V 900", fill = T)
Upvotes: 3
Views: 1278
Reputation: 38500
Here is a method that uses the melt
function that akrun performs and then uses matrix subsetting to return the desired result.
# melt and aggregate the data
temp <- melt(dt, measure = patterns("^Value", "^Problem"),
value.name = c("Value", "Problem"))[
!is.na(Value), .(Value=sum(Value)), by=.(Type, Problem)]
# set up the storage matrix
dimNames <- list(sort(unique(temp$Type)), unique(temp$Problem))
myMat <- matrix(0, length(dimNames[[1]]), length(dimNames[[2]]), dimnames=dimNames)
# fill in the matrix with the desired values
myMat[cbind(temp$Type, temp$Problem)] <- temp$Value
This returns the matrix
myMat
X Y W Z V
A 1100 1000 0 1100 0
B 0 700 200 100 200
C 1000 0 100 500 1400
To return a data.table, you can do
data.table(myMat, keep.rownames=TRUE)
rn X Y W Z V
1: A 1100 1000 0 1100 0
2: B 0 700 200 100 200
3: C 1000 0 100 500 1400
Upvotes: 3
Reputation: 887048
We can first melt
by specifying the patterns
in measure
to 'long' format and then do the dcast
with the fun.aggregate
as sum
dcast(melt(dt, measure = patterns("^Value", "^Problem"),
value.name = c("Value", "Problem"))[Problem != ""
][, Problem := factor(Problem, levels = c("X", "Y", "Z", "W", "V"))],
Type ~Problem, value.var = "Value", sum, na.rm = TRUE)
# Type X Y Z W V
#1: A 1100 1000 1100 0 0
#2: B 0 700 100 200 200
#3: C 1000 0 500 100 1400
melt
from data.table
can take multiple patterns
in the measure
argument. So, when we say "^Value"
it matches all the columns that have names that start (^
) with "Value" and similarly for "Problem" and create two 'value' columns. In the above, we are naming those columns as 'Value' and 'Problem' with value.name
argument. As the dataset having some blanks, the long format also have the blank elements which we remove with Problem != ""
. The next step is only important if we need to have the columns in a specific order. So, we change the 'Problem' to factor
class and specified the levels
in that order. Now, the melt
part is completed. The long format is now changed to 'wide' with dcast
by specifying the formula, value.var
column and the fun.aggregate
(here it is sum
)
Upvotes: 8
Reputation: 5146
This can be done quite easily with dplyr / tidyr:
library("dplyr")
library("tidyr")
# assume x is your dataframe
bind_rows(
select(x, ID, Type, Problem = Problem1, Value = Value1),
select(x, ID, Type, Problem = Problem2, Value = Value2),
select(x, ID, Type, Problem = Problem3, Value = Value3)
) %>%
filter(!(is.na(Problem))) %>%
group_by(Type, Problem) %>%
summarise(Value = sum(Value)) %>%
spread(Problem, Value, fill = 0)
Output
# A tibble: 3 x 6
# Groups: Type [3]
Type V W X Y Z
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0 0 1100 1000 1100
2 B 200 200 0 700 100
3 C 1400 100 1000 0 500
If the order of columns V - Z is important this can be easily fixed by adding a final select statement.
Upvotes: 3
Reputation: 28329
Dummy and straightforward way, but still works (hope that someone can help to improve my solution).
library(magrittr)
rbind(
dt[, .(Type, P = Problem1, V = Value1)],
dt[, .(Type, P = Problem2, V = Value2)],
dt[, .(Type, P = Problem3, V = Value3)]) %>%
.[P != ""] %>%
dcast(Type ~ P, value.var = "V", sum)
Edit Improved by following akrun's code (passing function to dcast
).
Upvotes: 3