Karl Wolfschtagg
Karl Wolfschtagg

Reputation: 567

Problems understanding log-log ggplots

I'm working with a very large data set (too large to post here) and I'm really struggling with creating a histogram that looks right. This was my best try with the original data:

g <- ggplot(df2, aes(x = n))
g <- g + geom_histogram(color = "white", fill = "firebrick3", bins = 47)
g <- g + scale_x_continuous(trans = 'log10', 
        breaks = trans_breaks('log10', function(x) 10^x), 
        labels = trans_format(math_format(10^.x)))
g <- g + scale_y_continuous(trans = 'log10',
        breaks = trans_breaks('log10', function(x) 10^x), 
        labels = trans_format(math_format(10^.x)), 
        oob = squish_infinite)
g <- g + annotation_logticks()
g <- g + labs(x = "n", y = "log(Count)")
g

This did not produce a plot; instead, it threw one error and two warnings:

In an attempt to make something that could be attempted by others, I ran a collection of lines that counted the number of times each n appeared (effectively making the histogram by hand). Here is that data:

n <- c(2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 
70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 
1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 
10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 
90000, 100000, 20000)
counts <- c(885452, 468462, 222097, 166234, 103348, 85845, 
60798, 52651, 231830, 81138, 41333, 25274, 17192, 12465, 
9622, 7371, 6069, 27160, 9009, 4465, 2753, 1664, 1285, 918, 
716, 568, 2400, 707, 362, 180, 106, 90, 55, 55, 39, 124, 
25, 12, 8, 2, 1, 0, 2, 0, 3, 2)

These were constructed using the [) format; e.g., the number of counts corresponding to n = 30 counts all of the n's appearing (30, 31, 32, 33, 34, 35, 36, 37, 38, 39) times.

The final histogram should:

I think I've missed something fundamental - any ideas?

Updating things with the most recent suggestions, the code is:

g <- ggplot(bigram2, aes(x = n))
g <- g + geom_histogram(color = "white", fill = "firebrick3", bins = 47)
g <- g + scale_x_continuous(trans = 'log10',
        breaks = trans_breaks('log10', function(x) 10^x), 
        labels = trans_format('log10', math_format(10^.x)))
g <- g + scale_y_continuous(trans = 'log10',
        breaks = trans_breaks('log10', function(x) 10^x), 
        labels = trans_format('log10', math_format(10^.x)))
g <- g + annotation_logticks()
g <- g + labs(x = "n", y = "log(Count)")
g

and the resulting plot looks like this: enter image description here

Upvotes: 0

Views: 425

Answers (1)

chemdork123
chemdork123

Reputation: 13833

OP, you're on the right track here. Ultimately, the issue comes down to a typo :/. I'll explain the 3 messages you received when trying your original code, then show you an example with dummy data that should be applicable to your dataset.

Your error messages.

OP references three messages received when running the code. Let's explain them (out of sequence):

  • Removed 2 rows containing missing values (geom_bar). This should not be an error, but a warning. It will not be relevant here, since it's just letting you know that a few have no value, so there is nothing to draw. You can safely ignore this.

  • Transformation introduced infinite values in continuous y-axis. This is also a warning message and can be safely ignored. It is expected that you have infinite values on the continuous y-axis when doing a log transformation when you have some bins that will have 0 counts. This is because log10(0) evaluates to -Inf. The plot is still able to be made, but these bins are the ones that are "removed" most likely. In your case, OP, you probably have a histogram with two of the bins in the sequence removed... because they contain nothing. No worries here.

  • Error in x * scale : non-numeric argument to binary operator. This one pops up because you effectively have a typo in your reference to trans_format() in the scale_*_continuous() functions. The function expects a trans= argument first (much like trans_breaks()), but you only specify the format via math_format(). When math_format() is applied to the trans= argument in trans_format()... you get that error.

Fixing the error message

The fix is pretty simple, which is to specify "log10" in trans_format(). In other words, use this: scale_*_continuous(... labels = trans_format("log10", math_format(10^.x)...), and not this scale_*_continuous(... labels = trans_format(math_format(10^.x)...)

I'll show this via a dummy dataset:

set.seed(1234)
d <- data.frame(n=sample(1:10000, size=1000000, replace=T))

Here's a histogram without the log transformations:

p <- ggplot(d, aes(x=n)) + geom_histogram(bins=30, color='black', fill='steelblue')
p

enter image description here

And the log-log transformation:

p +
  scale_x_continuous(
    trans='log10',
    breaks = trans_breaks('log10', function(x) 10^x), 
    labels = trans_format('log10', math_format(10^.x))) +
  scale_y_continuous(
    trans='log10',
    breaks = trans_breaks('log10', function(x) 10^x), 
    labels = trans_format('log10', math_format(10^.x))
    )

enter image description here

Upvotes: 2

Related Questions