Marco Meyer
Marco Meyer

Reputation: 373

R: Why does my heat map look differently depending on whether I sort my data first?

I make a heatmap in R that shows the dependency of a variable (Corona misinformation Score) on two other variables (Indifference Score and Rigidity Score). I do not understand why ordering my data according to the Corona misinformation score makes a difference for how the heatmap looks.

Here is the code I use to generate the graph:

dset %>%
  arrange(Mean_Corona) %>%
  ggplot(aes(x=Mean_Rigidity, y=Mean_Indifference, fill = Mean_Corona)) +
  geom_tile(alpha=0.8) +
  scale_fill_distiller(palette = "RdYlGn") +
  ylab("Indifference Score") +
  xlab("Rigidity Score") +
  labs(color="Corona Misinformation Score") +
  theme(
    legend.position="bottom", 
    panel.background = element_rect(fill = "white"),
    panel.grid.major = element_line(colour = "grey70", size = 0.2),
    panel.grid.minor = element_blank())

This is what the graph looks like:

Heatmap with arrange

If I run the same code but remove the second line (arrange(Mean_Corona) %>%), the heatmap looks instead like this:

Heatmap without arrange

If I order the data for the same variable in descending order, the heatmap looks different again. What I don't understand is why ordering rows in the dataset should make any difference to how the graph looks. Should not the shading of each tile just be determined by the average Corona Misinformation score for people with that score? I am stuck because I am not sure what the more accurate way of displaying my data is.

Upvotes: 0

Views: 470

Answers (2)

Allan Cameron
Allan Cameron

Reputation: 174476

You will notice the plots have all the tiles in the same position but that some tiles have different colours. You are quite right that the ordering of Mean_Corona shouldn't make a difference, but that is true only if the position of each tile is unique. If you have multiple values for each tile position and you sort for Mean_Corona, then the lower value tiles are plotted first, and the higher values are plotted on top of the lower values. If you reverse that ordering, the higher value tiles will be obscured by the lower value tiles.

We can see this more clearly if we create a small dummy data set with 8 unique tiles but only 4 unique tile positions:

dset <- data.frame(Mean_Corona = 1:8,
                   Mean_Indifference = rep(c(0.5, 1.5), 4),
                   Mean_Rigidity = rep(c(0.5, 1.5), each = 4))

So let's plot this with the original data frame, which happens to be sorted by Mean_Corona already:

dset %>%
  ggplot(aes(x=Mean_Rigidity, y=Mean_Indifference, fill = Mean_Corona)) +
  geom_tile(alpha=0.8) +
  scale_fill_distiller(palette = "RdYlGn") +
  ylab("Indifference Score") +
  xlab("Rigidity Score") +
  labs(color="Corona Misinformation Score") +
  theme(
    legend.position="bottom", 
    panel.background = element_rect(fill = "white"),
    panel.grid.major = element_line(colour = "grey70", size = 0.2),
    panel.grid.minor = element_blank())

enter image description here

Now we plot with the values in descending order. Here we see that the lower values have been plotted over the higher values:


dset %>%
  arrange(-Mean_Corona) %>%
  ggplot(aes(x=Mean_Rigidity, y=Mean_Indifference, fill = Mean_Corona)) +
  geom_tile(alpha=0.8) +
  scale_fill_distiller(palette = "RdYlGn") +
  ylab("Indifference Score") +
  xlab("Rigidity Score") +
  labs(color="Corona Misinformation Score") +
  theme(
    legend.position="bottom", 
    panel.background = element_rect(fill = "white"),
    panel.grid.major = element_line(colour = "grey70", size = 0.2),
    panel.grid.minor = element_blank())

enter image description here

One possible solution here is to group by both the indifference and rigidity scores, then take the average of the tiles at that position. That will ensure you have a single tile at each location that better reflects the relationship between variables.

dset %>%
  group_by(Mean_Rigidity, Mean_Indifference) %>%
  summarise(Mean_Corona = mean(Mean_Corona)) %>%
  ggplot(aes(x=Mean_Rigidity, y=Mean_Indifference, fill = Mean_Corona)) +
  geom_tile(alpha=0.8) +
  scale_fill_distiller(palette = "RdYlGn") +
  ylab("Indifference Score") +
  xlab("Rigidity Score") +
  labs(color="Corona Misinformation Score") +
  theme(
    legend.position="bottom", 
    panel.background = element_rect(fill = "white"),
    panel.grid.major = element_line(colour = "grey70", size = 0.2),
    panel.grid.minor = element_blank())

enter image description here

Upvotes: 1

Roel Peters
Roel Peters

Reputation: 11

You should remove the alpha, because the order defines in which way the tiles are plotted over each other.

Best regards Roel

Upvotes: 1

Related Questions