Slavatron
Slavatron

Reputation: 2358

R - Control Histogram Y-axis Limits by second-tallest peak

I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.

Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:

require(ggplot2)
set.seed(5)

df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))

cols = names(df)
for (i in c(1:length(cols))) {
  my_col = cols[i]
  p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10) 
  print(p1)
  p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
  print(p2)
  p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
  print(p3)
  p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
  print(p4)
}

The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.

If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.

Upvotes: 1

Views: 445

Answers (2)

gtwebb
gtwebb

Reputation: 3011

I would process the data to determine the height you need.

Something along the lines of:

sort(table(cut(df$X1,breaks=10)),T)[2]

Working from the inside out cut will bin the data (not really needed with integer data like you have but probably needed with real data

table then creates a table with the count of each of those bins

sort sorts the table from highest to lowest

[2] takes the 2nd highest value

Upvotes: 2

lmo
lmo

Reputation: 38510

I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:

maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]

Upvotes: 2

Related Questions