Reputation: 13
This problem has been brought up a million times on stacko but I couldn't seem to find a solution that tailored to my particular problem.
I have a data frame which includes a column of species and a column of genome_names:
species genome_name
Acinetobacter baumannii Acinetobacter baumanii BIDMC 56
Acinetobacter baumannii Acinetobacter baumannii 1032359
Klebsiella pneumoniae Klebsiella pneumoniae CHS 30
etc...
Using this code I created a barplot of species with a height of genome_name:
library(ggplot2)
ggplot(PATRIC_genomes_AMR_2_ris_subset,aes(x=species,fill=genome_name)) +
geom_bar(colour="black") + scale_colour_continuous(guide = FALSE) +
labs(title="Number of unique strains") +
labs(x = "Species",y="#Strains") + theme(legend.position="none") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
I would like to order this barplot in increasing value of y (number of genome_name). I blindly attempted to do this by putting my data in a factor to no avail:
Error in `[<-.data.frame`(`*tmp*`, del, value = NULL) :
missing values are not allowed in subscripted assignments of data frames
Upvotes: 1
Views: 90
Reputation: 43364
To order the bars, set species
to a factor with the levels sorted by occurrences.
Plotting is taking so long because you're actually drawing a bar for every pair of species
and genome_name
that occurs (12,339 of them, to be precise), and stacking the bars by species. If you just want black bars, if you take out the fill
aesthetic, ggplot can aggregate much more quickly, as it is only drawing one bar per species:
# download data
df <- gsheet::gsheet2tbl('https://docs.google.com/spreadsheets/d/16oHo85Pb8PVX2VqxlqEHizn10H3jVdjRC-kDrELcOfs/edit#gid=1638547987')
ggplot(df, aes(x = factor(species, names(sort(-table(species)))))) +
geom_bar(colour = "black") +
labs(title = "Number of unique strains") +
labs(x = "Species", y = "#Strains") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
If you plot with a fill
aesthetic with the same approach, you'll only get black bars anyway, as the colour
aesthetic set in geom_bar
is putting a black stroke around each stacked bar, which given how small they are is covering up the filled color. One way to avoid the issue is to simply take out colour = "black"
:
ggplot(df, aes(x = factor(species, names(sort(-table(species)))), fill = genome_name)) +
geom_bar() +
labs(title = "Number of unique strains") +
labs(x = "Species", y = "#Strains") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
If you really want a black stroke on each stacked bar, you'll need to set size
to something small enough that the fill is not covered by the stroke:
ggplot(df, aes(x = factor(species, names(sort(-table(species)))), fill = genome_name)) +
geom_bar(colour = "black", size = 0.01) +
labs(title = "Number of unique strains") +
labs(x = "Species", y = "#Strains") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
Upvotes: 0
Reputation: 11
reorder the factor levels before ploting:
df$species <- reorder(df$species, df$genome_name)
Edit: My bad for not looking at the data more closely. This plots the number of unique strains sorted by number.
library(dplyr)
library(ggplot2)
df %>%
group_by(species) %>%
summarise(unique_strains = length(unique(genome_name))) %>%
mutate(species = reorder(species, unique_strains)) %>%
ggplot(aes(species, unique_strains)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
xlab(NULL) +
scale_y_log10()
Upvotes: 1
Reputation: 23241
library(ggplot2)
PATRIC_genomes_AMR_2_ris_subset <- read.csv("genomes_subset.csv", header = T)
PATRIC_genomes_AMR_2_ris_subset <- dplyr::sample_n(PATRIC_genomes_AMR_2_ris_subset, 300)
PATRIC_genomes_AMR_2_ris_subset <- PATRIC_genomes_AMR_2_ris_subset[order(PATRIC_genomes_AMR_2_ris_subset$species),]
# Order by genome_name
PATRIC_genomes_AMR_2_ris_subset <- within(PATRIC_genomes_AMR_2_ris_subset,
Position <- factor(genome_name,
levels=names(sort(table(genome_name),
decreasing=TRUE))))
ggplot(PATRIC_genomes_AMR_2_ris_subset,aes(x=species,fill=genome_name)) +
geom_bar(colour="black") + scale_colour_continuous(guide = FALSE) +
labs(title="Number of unique strains") +
labs(x = "Species",y="#Strains") + theme(legend.position="none") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
# Order by species
PATRIC_genomes_AMR_2_ris_subset <- within(PATRIC_genomes_AMR_2_ris_subset,
species <- factor(species,
levels=names(sort(table(species),
decreasing=TRUE))))
ggplot(PATRIC_genomes_AMR_2_ris_subset,aes(x=species,fill=genome_name)) +
geom_bar(colour="black") + scale_colour_continuous(guide = FALSE) +
labs(title="Number of unique strains") +
labs(x = "Species",y="#Strains") + theme(legend.position="none") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
This is pretty much the same as this but with yours you mentioned ordering it by the fill value, genome_name
, which is a little different and we also got to see how the ordering affects the run time, so it's not a duplicate.
Upvotes: 1