Cooper James
Cooper James

Reputation: 35

How to add categorical variables to a percentage stacked bar chart?

First time posting here so let me know if I left out any details that are normally included. I am using ggplot2 and ggdendro to make a stacked bar percentage chart with a heirarchical clustered tree where each node is associated with one of my bars.

1

As you can see I have more or less figured this out (note this is just a subset of my data. I now want to associate a categorical variable with each my bars, where each variable would be represented by a color (in my case this is HIV+ or HIV- and each bar represents % of cells in a given category). Additionally I want to figure out how to add the sample name to each dendrogram node but this issue is less pressing. Below is the code block I am using.

library(ggplot2)
library(ggdendro)

# Load in phenograph data
TotalPercentage <- read.csv("~/TotalPercentage.csv", header=TRUE)

#generate tree
tree <- hclust(dist(TotalPercentage))
tree <- dendro_data(tree)

data <- cbind(TotalPercentage, x = match(rownames(TotalPercentage), tree$labels$label))



# plot below stacked bar, in "data = tidyr::pivot_longer(data, c(2..." include
## all columns (clusters) but exclude colun 1 as this value is our sample ID

scale <- .5
p <- ggplot() +
  geom_col(
    data = tidyr::pivot_longer(data, c(2, 3 , 4, 5, 6, 7, 8)),
    aes(x = x,
        y = value, fill = factor(name)),
  ) +
  labs(title="Unsupervised Clustering of Phenograph Output",
          x ="Cluster Representation (%)", y = "Participant Sample"
  ) +
  geom_segment(
    data = tree$segments,
    aes(x = x, y = -y * scale, xend = xend, yend = -yend * scale)
  )

p

Here is a sample dataset with fewer rows for simplicity

data.frame(
  `Participant ID` = c("123", "456", "789"),
  `1` = c(.1933, .1721, 34.26),
  `2` = c(20.95, 4.97, 2.212),
  `3` = c(11.31, 35.34, .027),
  `4` = c(35.55, 15.03, 0),
  `5` = c(.26, .87, 7.58),
  `6` = c(12.85, 33.44, .033),
  `7` = c(2.04, 3.77, 4.32)
)

Where Patient one and three have HIV but patient 2 is HIV negative

And finally here is an example of what I am ultimately trying to produce

(https://i.sstatic.net/uAWxR.png)

I've looked all over to see how to do this but I'm new to R so I'm kind of free floating and don't know what to do next. Thanks in advance for any help.

Upvotes: 2

Views: 242

Answers (3)

Yun
Yun

Reputation: 305

Another option is ggalign, you can

# randomly generated phenograph data
set.seed(1)
TotalPercentage <- data.frame(
    `Participant ID` = c("123", "456", "789"),
    `1` = 125 * runif(72),
    `2` = 75 * runif(72),
    `3` = 175 * runif(72),
    `4` = 10 * runif(72),
    `5` = 100 * runif(72),
    `6` = 150 * runif(72),
    `7` = 200 * runif(72),
    check.names = FALSE
)
library(ggalign)
#> Loading required package: ggplot2
ggstack(TotalPercentage) +
    # add color bar plot
    # we transform the input data frame into a long format data frame
    ggalign(action = plot_action(
        data = function(x) {
            ans <- tidyr::pivot_longer(x,
                cols = as.character(1:7),
                names_to = "group"
            )
            dplyr::summarise(
                ans,
                value = sum(value), .by = c(`Participant ID`, group, .y)
            )
        }
    )) +
    geom_col(aes(value, .y, fill = group),
        orientation = "y",
        position = position_fill()
    ) +
    scale_fill_brewer(palette = "Dark2") +
    # add dendrogram
    align_dendro(data = ~ .x[-1L]) &
    scale_x_continuous(expand = expansion()) &
    theme(plot.margin = margin(l = 5, r = 10))

enter image description here

Created on 2024-10-21 with reprex v2.1.0 ~

Upvotes: 1

Cooper James
Cooper James

Reputation: 35

Thanks to help from @Sandipan Dey, @Tal Galili and @r2evans I have a plot I'm happy with. I figured I would post my final plot and code block here just so others could see it. I still have pretty bad data but the code should be usable for smaller more managebale datasets in any case

Here is my code

# Load in relevant packages
library(ggplot2)
library(ggdendro)
library(scales)

# Load in phenograph data using read.csv then store it as a dataframe
CSV <- (read.csv("~/TotalPercentage.csv", header=TRUE))
TotalPercentage <- as.data.frame(CSV)

# Cluster your data using hclust, type of clustering can be specified if desired
## The only thing that needs to be changed here is [5:43], 
###just tell it which columns you want to consider for clustering
tree <- hclust(dist(TotalPercentage[5:43]))
tree <- dendro_data(tree)

# Here you same as above change [,5:43] to the colums you want to consider
## Scale defines the size of the tree on the X axis, this can be modulated per preference
data <- cbind(TotalPercentage, x = match(rownames(TotalPercentage), tree$labels$label))
data[,5:43] <- data[,5:43] / rowSums(data[,5:43]) # row-normalize
scale <- 3e-3

##ggplot allows us to make our graph, Look for # to see why certain commands are there
ggplot() +
  geom_col(
    # Be sure to change (,5:43) as above so it fits your data
    # Here we are making our data more readble to R by making it "longer" and using that to make a bar graph
    data = tidyr::pivot_longer(data, c(,5:43)),
    aes(x = x,
        y = value, fill = factor(name)),
  ) +
   # These are our labels, you can specify the labels, not the x and y axis are flipped because we use the coordflip() function in the last line
   ## To be honest I dont love how this looks and will try to fix the presentation in the 
   labs(title="Unsupervised Clustering of Phenograph Output",
       y ="Cluster Representation (%)", x = "Participant Sample"
  ) +
  # Here we attach the tree to the bar graph and specify where it goes using y = and yend = 
  ## I personally like how this looks but if you wanted your tree on the left that can be done
  geom_segment(
    data = tree$segments,
    aes(x = x, y = y * scale + 1, xend = xend, yend = yend * scale + 1)
  ) +
 
  # Here we add the labels for each sample next to its respective bar and also do some formatting
  geom_text(data = label(tree), 
    aes(x = x, y = y, label = data$Cluster, hjust = 1), 
    size = 3
  ) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank()
  ) +
  scale_y_continuous(limits=c(-.5,1.5), labels = scales::percent, breaks = c(0, .50, 1.00)) +
  
  # I couldn't figure out a better way to do this but here I label each cluster 
  ## If you do the work to determine what is in each cluster you can replace for example 'Cluster 1' with 'CD4+ proliferating cells'
  scale_fill_discrete(labels=c('Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4',
                               'Cluster 5', 'Cluster 6', 'Cluster 7', 'Cluster 8',
                               'Cluster 9', 'Cluster 10', 'Cluster 11', 'Cluster 12',
                               'Cluster 13', 'Cluster 14', 'Cluster 15', 'Cluster 16',
                               'Cluster 17', 'Cluster 18', 'Cluster 19', 'Cluster 20',
                               'Cluster 21', 'Cluster 22', 'Cluster 23', 'Cluster 24',
                               'Cluster 25', 'Cluster 26', 'Cluster 27', 'Cluster 28',
                               'Cluster 29', 'Cluster 30', 'Cluster 31', 'Cluster 32',
                               'Cluster 33', 'Cluster 34', 'Cluster 35', 'Cluster 36',
                               'Cluster 37', 'Cluster 38', 'Cluster 39')
                      ) +
  guides(fill=guide_legend(title="Cluster")) +
  # some final formatting, I think the plot looks better left to right so I went with coord_flip which moves it on its side
  theme(
  axis.text.y = element_blank(),
  axis.ticks.y = element_blank()
  ) +
  coord_flip()

And here is the plot it gave me

Upvotes: 1

Sandipan Dey
Sandipan Dey

Reputation: 23129

Something like this, with randomly generated data:

# randomly generated phenograph data
set.seed(1)
TotalPercentage <- data.frame(
  `Participant ID` = c("123", "456", "789"),
  `1` = 125*runif(72),
  `2` = 75*runif(72),
  `3` = 175*runif(72),
  `4` = 10*runif(72),
  `5` = 100*runif(72),
  `6` = 150*runif(72),
  `7` = 200*runif(72)
)

Now cluster, normalize and plot:

tree <- hclust(dist(TotalPercentage))
tree <- dendro_data(tree)
data <- cbind(TotalPercentage, x = match(rownames(TotalPercentage), tree$labels$label))
data[,2:8] <- data[,2:8] / rowSums(data[,2:8]) # row-normalize
scale <- 3e-4
ggplot() +
  geom_col(
    data = tidyr::pivot_longer(data, c(2, 3 , 4, 5, 6, 7, 8)),
    aes(x = x,
        y = value, fill = factor(name)),
  ) +
  labs(title="Unsupervised Clustering of Phenograph Output",
       x ="Cluster Representation (%)", y = "Participant Sample"
  ) +
  geom_segment(
    data = tree$segments,
    aes(x = x, y = -y * scale, xend = xend, yend = -yend * scale)
  ) +
  coord_flip()

enter image description here

Upvotes: 1

Related Questions