tropicalbath
tropicalbath

Reputation: 133

Interpreting binned scatterplot (R) and calculating variance of the mean

I am trying to plot the simulation result against the samples. Therefore, I have many data points - so I opted for the binned scatterplot. It was suggested in one of the papers to plot the binned plot in order to calculate the first-order effect by calculating the variance of the coloured points. My plot looks like this:

I used the code for this plot from here: making binned scatter plots for two variables in ggplot2 in R

However, I do not quite know how to interpret the plot. I understand that the coloured points are the average of the bins but what does it actually tells us about the data and how do I further calculate the variance of these yellow points?

Can we imply from this plot that the variables show (weak) linear relationship even though some of the yellow points do not really follow the trend?

Thank you in advance!

Binned scatter plot

Upvotes: 0

Views: 1464

Answers (1)

maydin
maydin

Reputation: 3755

We can bin the data by the cut() function as follows,

mybin <- cut(df$x,20,include.lowest=TRUE,right = FALSE)
df$Bins <- mybin

Then to calculate the mean of the binned data,

library(tidyverse)

out<- df %>% group_by(Bins) %>% summarise(x=mean(x),y=mean(y)) %>% as.data.frame()

To compare our results with the stat_summary_bin() function of the ggplot2 we can plot them together,

(ggplot(df, aes(x=x,y=y)) +
  geom_point(alpha = 0.4) +
  stat_summary_bin(fun='mean', bins=20,
                   color='orange', size=2, geom='point') +
     geom_point(data = out,color="green"))

# green dots are the points we calculated. They are perfectly matching.

enter image description here

Now, to calculate the variance, we can simply follow the same process with the var() function. So,

 df %>% group_by(Bins) %>% summarise(Varx=var(x),Vary=var(y)) %>% as.data.frame()

gives the variance of the binned data. Note that, since the x axis is binned, the variance of x will be almost zero. So,the important one in here is the variance of the y axis actually.

  • The variances of the binned data gives us a mimic about the heteroscedasticity of the data.

  • The path of the binned mean also shows the pattern of the data. So your data have a positive trend. (No need to see a perfect smooth line). But it becomes weaker because of the different means around as you suggested.

Data:

set.seed(42)
x <- runif(1000)
y <- x^2 + x + 4 * rnorm(1000)
df <- data.frame(x=x, y=y)

Note: The data and some of the ggplot2 codes have been taken from the OP's referred question.

Upvotes: 1

Related Questions