Daniel Harris
Daniel Harris

Reputation: 13

Order barplots in R based on fill value

This problem has been brought up a million times on stacko but I couldn't seem to find a solution that tailored to my particular problem.

I have a data frame which includes a column of species and a column of genome_names:

species                  genome_name
Acinetobacter baumannii  Acinetobacter baumanii BIDMC 56 
Acinetobacter baumannii  Acinetobacter baumannii 1032359
Klebsiella pneumoniae    Klebsiella pneumoniae CHS 30
etc...

Using this code I created a barplot of species with a height of genome_name:

library(ggplot2)
ggplot(PATRIC_genomes_AMR_2_ris_subset,aes(x=species,fill=genome_name)) + 
  geom_bar(colour="black") + scale_colour_continuous(guide = FALSE) + 
  labs(title="Number of unique strains") +
  labs(x = "Species",y="#Strains") + theme(legend.position="none") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) 

I would like to order this barplot in increasing value of y (number of genome_name). I blindly attempted to do this by putting my data in a factor to no avail:

Error in `[<-.data.frame`(`*tmp*`, del, value = NULL) : 
missing values are not allowed in subscripted assignments of data frames

Upvotes: 1

Views: 90

Answers (3)

alistaire
alistaire

Reputation: 43364

To order the bars, set species to a factor with the levels sorted by occurrences.

Plotting is taking so long because you're actually drawing a bar for every pair of species and genome_name that occurs (12,339 of them, to be precise), and stacking the bars by species. If you just want black bars, if you take out the fill aesthetic, ggplot can aggregate much more quickly, as it is only drawing one bar per species:

# download data
df <- gsheet::gsheet2tbl('https://docs.google.com/spreadsheets/d/16oHo85Pb8PVX2VqxlqEHizn10H3jVdjRC-kDrELcOfs/edit#gid=1638547987')

ggplot(df, aes(x = factor(species, names(sort(-table(species)))))) + 
    geom_bar(colour = "black") + 
    labs(title = "Number of unique strains") +
    labs(x = "Species", y = "#Strains") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) 

plot with black bars

If you plot with a fill aesthetic with the same approach, you'll only get black bars anyway, as the colour aesthetic set in geom_bar is putting a black stroke around each stacked bar, which given how small they are is covering up the filled color. One way to avoid the issue is to simply take out colour = "black":

ggplot(df, aes(x = factor(species, names(sort(-table(species)))), fill = genome_name)) + 
    geom_bar() + 
    labs(title = "Number of unique strains") +
    labs(x = "Species", y = "#Strains") + 
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) 

plot with colored bars

If you really want a black stroke on each stacked bar, you'll need to set size to something small enough that the fill is not covered by the stroke:

ggplot(df, aes(x = factor(species, names(sort(-table(species)))), fill = genome_name)) + 
    geom_bar(colour = "black", size = 0.01) + 
    labs(title = "Number of unique strains") +
    labs(x = "Species", y = "#Strains") + 
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) 

plot with colored bars with black stroke

Upvotes: 0

Tyler Moss
Tyler Moss

Reputation: 11

reorder the factor levels before ploting:

df$species <- reorder(df$species, df$ge‌​nome_name)

Edit: My bad for not looking at the data more closely. This plots the number of unique strains sorted by number.

library(dplyr)
library(ggplot2)

df %>%
  group_by(species) %>%
  summarise(unique_strains = length(unique(genome_name))) %>%
  mutate(species = reorder(species, unique_strains)) %>%
  ggplot(aes(species, unique_strains)) + geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + 
  xlab(NULL) +
  scale_y_log10()

Upvotes: 1

Hack-R
Hack-R

Reputation: 23241

library(ggplot2)
PATRIC_genomes_AMR_2_ris_subset <- read.csv("genomes_subset.csv", header = T)
PATRIC_genomes_AMR_2_ris_subset <- dplyr::sample_n(PATRIC_genomes_AMR_2_ris_subset, 300)

PATRIC_genomes_AMR_2_ris_subset <- PATRIC_genomes_AMR_2_ris_subset[order(PATRIC_genomes_AMR_2_ris_subset$species),]


# Order by genome_name
PATRIC_genomes_AMR_2_ris_subset <- within(PATRIC_genomes_AMR_2_ris_subset, 
                   Position     <- factor(genome_name, 
                                      levels=names(sort(table(genome_name), 
                                                        decreasing=TRUE))))

enter image description here

ggplot(PATRIC_genomes_AMR_2_ris_subset,aes(x=species,fill=genome_name)) + 
  geom_bar(colour="black") + scale_colour_continuous(guide = FALSE) + 
  labs(title="Number of unique strains") +
  labs(x = "Species",y="#Strains") + theme(legend.position="none") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) 

# Order by species
PATRIC_genomes_AMR_2_ris_subset <- within(PATRIC_genomes_AMR_2_ris_subset, 
                                          species <- factor(species, 
                                                         levels=names(sort(table(species), 
                                                         decreasing=TRUE))))

ggplot(PATRIC_genomes_AMR_2_ris_subset,aes(x=species,fill=genome_name)) + 
  geom_bar(colour="black") + scale_colour_continuous(guide = FALSE) + 
  labs(title="Number of unique strains") +
  labs(x = "Species",y="#Strains") + theme(legend.position="none") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) 

enter image description here

This is pretty much the same as this but with yours you mentioned ordering it by the fill value, genome_name, which is a little different and we also got to see how the ordering affects the run time, so it's not a duplicate.

Upvotes: 1

Related Questions