Reputation: 3
I'm working on a research model using R that is calculating anomaly scores (RMSE values) for three distinct groups within a large dataset. The anomaly scores are a continuous data type and are quite small, ranging from approximately 1e-04 to 1e-07 across a population of approximately 1 million observations.
I have all of the summary and descriptive statistics for each of the anomaly score distributions across each group label in the dataset, and I am able to create some useful histograms showing how each of the three groups is uniquely distributed across the range of scores. However, because of the large variance within the frequency of score values and the high density peaks within much of the anomaly scores, I need to use a log transformation within the histogram to show both the log frequency count of each binned observation range (y-axis) and a log transformation of the binned score values (x-axis) to be able to appropriately illustrate the distributions within the data and make it more readily understandable.
Fortunately, ggplot2 is really useful for creating some very attractive dual-axis log transformed histograms.
However, I cannot figure out a way to create the log transformed histograms to show each of my three groups by color within the same histogram. I would want it to look like this histogram below this paragraph, BUT use a log transformation for both the x and y-axis. This plot below shows the 3 groups in one histogram but uses the default normal values:
For log transformed axis values, the best I can do so far is produce three separate histograms, one for each group:
Below is sample R code to illustrate my problem with a randomly-generated example dataset and the ggplot2 approaches that I have taken so far:
library(ggplot2)
library(dplyr)
library(hrbrthemes)
I created some simple random sample data to produce an example dataset. This produces an example dataframe called d, which contains a class label IV of either A, B or C for each observation. The target variable is the anomaly_score continuous value for each observation. There are 300 rows of dummy data in this dataframe.
DV_score_generator = round(runif(300,0.001,0.999), 3)
d <- data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE, prob=c(0.65, 0.30, 0.05) ), anomaly_score = DV_score_generator)
First, I use ggplot to create the normal distribution histogram that shows all 3 groups on the same plot, by color. Please note that with this small set of randomized sample data it doesn't appear to be necessary to use an x and y-axis log transformation to show the distribution patterns, but it does becomes an issue with my vastly larger and more complex score values in the DV of the actual data.
p <- d %>%
ggplot( aes(x=anomaly_score, fill=label)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) +
theme_ipsum() +
labs(fill="")
p
Produces this normal multiclass histogram:
Create the grouping subsets.
group_a <- d[ which(d$label =='A'), ]
group_b <- d[ which(d$label =='B'), ]
group_c <- d[ which(d$label =='C'), ]
Now produce a series of dual axis log-transformed histograms, producing one histogram for each distinct label class in the dataset:
# Group A, log transformed
ggplot(group_a, aes(x = anomaly_score)) +
geom_histogram(aes(y = ..count..), binwidth = 0.05,
colour = "darkgoldenrod1", fill = "darkgoldenrod2") +
scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") +
scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") +
ggtitle("Transformed Anomaly Scores - Group A Only")
Group A transformed histogram:
# Group B, log transformed
ggplot(group_b, aes(x = anomaly_score)) +
geom_histogram(aes(y = ..count..), binwidth = 0.05,
colour = "green", fill = "darkgreen") +
scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") +
scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") +
ggtitle("Transformed Anomaly Scores - Group B Only")
Group B transformed histogram:
# Group C, log transformed
ggplot(group_c, aes(x = anomaly_score)) +
geom_histogram(aes(y = ..count..), binwidth = 0.05,
colour = "red", fill = "darkred") +
scale_x_continuous(name = "Log-scale Anomaly Score", trans="log2") +
scale_y_continuous(trans="log2", name="Log-transformed Frequency Counts") +
ggtitle("Transformed Anomaly Scores - Group C Only")
Group C transformed histogram:
End.
Thanks in advance, everyone!
Upvotes: 0
Views: 686
Reputation: 37933
I'm not quite sure what problem exactly you're running into. The suggested strategy seems to work just fine; example below:
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
DV_score_generator = round(runif(300,0.001,0.999), 3)
d <- data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE, prob=c(0.65, 0.30, 0.05) ),
anomaly_score = DV_score_generator)
p <- d %>%
ggplot( aes(x=anomaly_score, fill=label)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2", "blue", "#404080")) +
labs(fill="")
p +
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Transformation introduced infinite values in continuous y-axis
If you're bothered by 0s showing up as negative bars, you might take the pseudo log instead (log transforms makes 0s into -Inf, which has special meaning in ggplot2 as the bottom-most position in a panel).
p +
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "pseudo_log")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Created on 2021-08-05 by the reprex package (v2.0.0)
A small aside, the y-axis title 'Log-transformed Frequency Counts' might be inappropriate for the data. If I see that, I might expect the counts to be log-transformed and the axis label 2 might thus mean 2^2 = 4.
Upvotes: 0