Reputation: 73
Say I have a R data frame with 5 columns as shown below
time MeanVar1 SdVar1 MedianVar1 MeanVar2 SdVar2
1 -0.8453978 -1.636985 -0.6239832 -0.4366982 -1.7037374
2 -0.3000778 -1.034199 0.3292459 -0.6606399 -0.1525361
Is there a neat way to make the dataFrame as follows:
Var time Mean/Median SD
1 1 -0.8453978 -1.636985
1 2 -0.3000778 -1.034199
1 1 -0.6239832 N/A
1 2 0.3292459 N/A
2 1 -0.4366982 -1.7037374
2 2 -0.6606399 -0.1525361
or
Var time Mean/Median SD
MeanVar1 1 -0.8453978 -1.636985
MeanVar1 2 -0.3000778 -1.034199
MeanVar1 1 -0.6239832 N/A
MeanVar1 2 0.3292459 N/A
MeanVar2 1 -0.4366982 -1.7037374
MeanVar2 2 -0.6606399 -0.1525361
My overall intention is to plot the Mean,SD for Variable 1 with errorbars, Median of Variable1 and Mean,SD for Variable 1 in a same plot. Hence, I feel that if I modify the data in such a format, I can just plot it at once rather than plotting each line separately.
With my limited knowledge of Reshape and melt, I couldn't do this.
EDIT : Adding more information
Sample Input (Given 3 rows, there are 100 rows in total):
Label trainingSize Accuracy_Mean Accuracy_SD Accuracy_SE Precision_Mean Recall_Mean F1 Accuracy_Median PriorClass0_Mean PriorClass0_SD PriorClass0_SE ProbabilityEstimate_0given0_Mean ProbabilityEstimate_0given0_SD ProbabilityEstimate_0given0_SE ProbabilityEstimate_0given1_Mean ProbabilityEstimate_0given1_SD ProbabilityEstimate_0given1_SE
0perc_0repeat 0.4 0.5506 0.0531 0.0038 0.6374 0.2336 0.3419 0.5372 0.5278 0.0254 0.0018 0.6433 0.0028 0.0 0.4169 0.003 0.0
0perc_0repeat 0.4 0.5456 0.0482 0.0034 0.6465 0.2142 0.3218 0.5333 0.5304 0.0248 0.0018 0.6414 0.0028 0.0 0.4193 0.0027 0.0
0perc_0repeat 0.4 0.5574 0.0555 0.0039 0.6604 0.2197 0.3297 0.5404 0.529 0.0233 0.0016 0.6436 0.003 0.0 0.4163 0.0029 0.0
I am trying to plot the
1) the iteration number(1:100) in X Axis and the points of 5 columns (Accuracy_Mean, Accuracy_Median, PriorClass0_Mean, ProbabilityEstimate_0given0_Mean, ProbabilityEstimate_0given1_Mean in the Y AXIS.
2) distribution (density obtained by 100 points) of 5 columns with error bars (either SD or SE) in a single plot using ggplot.
I have 4 columns Precision_Mean,Recall_Mean,F1,Accuracy_Median not following the mean,sd,se pattern!
EDIT1: 1)
dput(droplevels(head(data, 3))) structure(list(Label = structure(c(1L, 1L, 1L), .Label = "0perc_0repeat", class = "factor"), trainingSize = c(0.4, 0.4, 0.4), Accuracy_Mean = c(0.5506, 0.5456, 0.5574), Accuracy_SD = c(0.0531, 0.0482, 0.0555), Accuracy_SE = c(0.0038, 0.0034, 0.0039), Precision_Mean = c(0.6374, 0.6465, 0.6604), Recall_Mean = c(0.2336, 0.2142, 0.2197), F1 = c(0.3419, 0.3218, 0.3297), Accuracy_Median = c(0.5372, 0.5333, 0.5404), PriorClass0_Mean = c(0.5278, 0.5304, 0.529 ), PriorClass0_SD = c(0.0254, 0.0248, 0.0233), PriorClass0_SE = c(0.0018, 0.0018, 0.0016), ProbabilityEstimate_0given0_Mean = c(0.6433, 0.6414, 0.6436), ProbabilityEstimate_0given0_SD = c(0.0028, 0.0028, 0.003), ProbabilityEstimate_0given0_SE = c(0, 0, 0), ProbabilityEstimate_0given1_Mean = c(0.4169, 0.4193, 0.4163), ProbabilityEstimate_0given1_SD = c(0.003, 0.0027, 0.0029), ProbabilityEstimate_0given1_SE = c(0, 0, 0)), .Names = c("Label", "trainingSize", "Accuracy_Mean", "Accuracy_SD", "Accuracy_SE", "Precision_Mean", "Recall_Mean", "F1", "Accuracy_Median", "PriorClass0_Mean", "PriorClass0_SD", "PriorClass0_SE", "ProbabilityEstimate_0given0_Mean", "ProbabilityEstimate_0given0_SD", "ProbabilityEstimate_0given0_SE", "ProbabilityEstimate_0given1_Mean", "ProbabilityEstimate_0given1_SD", "ProbabilityEstimate_0given1_SE"), row.names = c(NA, 3L), class = "data.frame")
2) Expected Output is something like:
Vars Label trainingSize Mean SD SE
Vars : Mean, PriorClass0, ProbabilityEstimate_0given0, ProbabilityEstimate_0given1; (Median, Precision, Recall, F1 not required or they can be fit to the above table with SD,SE as N/A or 0).
Upvotes: 1
Views: 240
Reputation: 193507
merged.stack
from my "splitstackshape" package handles this to a certain extent, but it recycles the values from your "SdVar" columns (so I don't get the NA
values you show in your desired output).
Nevertheless, it might be a start to solving your problem, so here's the approach:
library(splitstackshape)
merged.stack(mydf, var.stubs = c("MeanVar|MedianVar", "SdVar"), sep = "var.stubs")
# time .time_1 MeanVar|MedianVar SdVar
# 1: 1 1 -0.8453978 -1.6369850
# 2: 1 1 -0.6239832 -1.6369850
# 3: 1 2 -0.4366982 -1.7037374
# 4: 2 1 -0.3000778 -1.0341990
# 5: 2 1 0.3292459 -1.0341990
# 6: 2 2 -0.6606399 -0.1525361
If you really want those NA
values, perhaps this would do the trick:
merged.stack(
mydf, var.stubs = c("MeanVar|MedianVar", "SdVar"),
sep = "var.stubs")[, SdVar := ifelse(
duplicated(SdVar), NA, SdVar), by = time][]
# time .time_1 MeanVar|MedianVar SdVar
# 1: 1 1 -0.8453978 -1.6369850
# 2: 1 1 -0.6239832 NA
# 3: 1 2 -0.4366982 -1.7037374
# 4: 2 1 -0.3000778 -1.0341990
# 5: 2 1 0.3292459 NA
# 6: 2 2 -0.6606399 -0.1525361
Upvotes: 3