Matt
Matt

Reputation: 137

Multilevel pie charts with consistent coloring

I'm trying to create a multilevel pie chart for several files that are in the following format:

117.txt

compartment percent sequence
dna         90      AAGTGT
dna          3      AAGTGG
dna          0      AAAAAA
...
rna         75      AAGTGT
rna         10      AAAAAA
rna         10      AAGTGG
...
...
plasma      75      AAGTGT
plasma      10      AAGTGG
plasma       0      AAAAAA

I'm trying to create concentric pie charts as a figure by ggplot with a unique color for every distinct sequence based on each file like the simplified one above (which I can read in as a dataframe df). For every compartment, there are 2951 unique sequences that are present and have a percent indicated or are indicated with a "0" if not. Therefore, every file has 2951 seqs *3 compartments = 8853 lines.

So far the code I have works well for individual files, the order of sequences doesn't necessarily follow the order of my custom palette nor are the colors consistent across each file (i.e. such that the "AAGTGT" sequence always is the same color across different input files). @Prem helped me a bit with a similar question but I can't figure out what's going on here. The code is below:

library(ggplot2)
library(randomcoloR)

pal<-c(randomColor(count=2951))
ggplot(df, aes( x=compartment, y=percent, fill=sequence) ) + labs(title="117") 
    + geom_bar(stat = "identity") + scale_fill_manual(values=pal) 
    + scale_x_discrete(limits=c("dna", "rna", "plasma"), labels=c("plasma"="Plasma\nvRNA", "rna"="RNA","dna"="DNA")) 
    + theme_bw() + theme(legend.position="none") + coord_polar(theta="y") 
    + theme(axis.line = element_blank(), panel.grid.major.x = element_blank(), panel.grid.major.y = element_blank(), 
      panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()) 
    + theme(axis.text=element_blank(), axis.title = element_blank(), axis.ticks = element_blank()) 
    + theme(plot.title = element_text(colour="black", face="bold", size=24, hjust=0.5))

When I run it on my larger data file with my 2951 sequences for each of the three compartments, not only do my palette colors not necessarily follow the order of the sequences, but they are not consistent across graphs (see attached figure for data sets #117 and #129 whose majority sequences should be the same color). enter image description here

Any help would be extremely appreciated as I think this representation is truly helpful for the message of my data. Thanks everyone!

Upvotes: 1

Views: 776

Answers (1)

eipi10
eipi10

Reputation: 93761

I can't be certain without a reproducible example to work from, but I think a named vector of fill colors will give consistent colors. For example:

set.seed(2) # For reproducibility of random color vector
pal <- randomColor(count=2951)
pal = setNames(pal, unique(df$sequence))

Now run your plot code as usual. By using a named vector of colors where the names are the levels of sequence, you should always get the same color assigned to the same sequence.

(I'm also assuming in the code above that there are 2,951 unique levels of sequence. A better approach would be pal <- randomColor(count=length(unique(df$sequence))) so that you get the number of colors from the data, rather than hard-coding it.)

The above will work for a single data frame or for a group of data frames where every data frame includes all possible sequences that can appear in any data frame.

If you have multiple data frames that can contain different sequences, then create the named color vector based on the collection of unique sequences across all the data frames. Ideally, your data frames would be in a list (let's assume it's called df.list) where each element is a data frame. Then you could do:

sequences = unique(unlist(sapply(df.list, function(d) d$sequence)))
set.seed(2)
pal <- randomColor(count=length(sequences))
pal = setNames(pal, sequences)

If your data frames are loaded as separate objects (i.e., not in a list) you could do:

sequences = unique(unlist(sapply(list(df1, df2, df3), function(d) d$sequence)))

where df1, df2, and df3 are your separate data frames.

Upvotes: 1

Related Questions