Reputation: 93
I am trying to generate a histogram with some data, but I can't find a way to make ggplot2 work to achieve what I want.
For context, my data looks like this: (column names)
​
| Name | Total Enrichment % (A+B+C+D) | %A | %B | %C | %D |
I want to generate a histogram showing the distribution of the total Enrichment column and then filling the column with 4 colors showing the different percentages of A, B, C, and D.
I've tried to convert the data into long format, but still, I cannot seem to get exactly what I want.
Any advice would be very helpful! Thank you very much!
Here is an example (it's not the original data, just a small part of it):
dat <- read.table(text = "Name Total A B C D
1 0.1396104 0.029220779 0.009740260 0.029220779 0.07142857
2 0.1250000 0.010869565 0.021739130 0.016304348 0.07608696
3 0.1337580 0.006369427 0.000000000 0.025477707 0.10191083
4 0.1239669 0.016528926 0.024793388 0.033057851 0.04958678
5 0.1242938 0.011299435 0.016949153 0.039548023 0.05649718
6 0.1311475 0.000000000 0.000000000 0.021857923 0.10928962
7 0.1376147 0.004587156 0.004587156 0.004587156 0.12385321
8 0.1574074 0.046296296 0.018518519 0.032407407 0.06018519
9 0.1269036 0.010152284 0.010152284 0.020304569 0.08629442", sep = "", header=T)
My goal is to create a histogram with the Total enrichment data, but with each column filled with the other contribution variables (A, B, C and D)
Thanks!
Edit
Thanks to StupidWolf amazing help and comments I could come a little bit closer to what I want.
Here is what I've fot so far (It's not perfect, but so far so good)
What I would like to do is to have the y axis in logarithmic scale, since I have a lot of data in the lower range, and I'm also interesed in the data with a higher enrichment. Also, does anyone know why the bars are not filled? Why are there these white spaces?
Again, thank you very much for your help and patience!
Upvotes: 0
Views: 1212
Reputation: 46898
I am making an educated guess on what you want to do, first let's get some data:
set.seed(321)
library(ggplot2)
library(dplyr)
dat = data.frame(Name=1:500,matrix(runif(500*4),ncol=4))
colnames(dat)[-1] = LETTERS[1:4]
dat$Total = rowSums(dat[,-1])
If you want to calculate the contribution of A,B,C and D to each binned value of Total, then we need to do a histogram of Total, it looks like this, and we store the breaks to classify each row:
his_all = hist(dat$Total,br=40)
dat$bin = cut(dat$Total,br=his_all$breaks,labels=his_all$mids)
In the above, I used the middle of the histogram to represent the position to plot the bar again. Hence there's a step to convert the factor label to numeric. Then we need to calculate the contribution of A to D to each total, then pivot longer and plot :
dat %>%
mutate_at(c("A","B","C","D"),~.x/Total) %>%
pivot_longer(A:D) %>%
mutate(bin=as.numeric(as.character(bin))) %>%
ggplot(aes(x=bin,y=value,fill=name)) +
geom_col() +
xlab("enrichment")
Another way to visualize your data:
dat$interval = cut_interval(dat$Total,5)
dat %>% mutate_at(c("A","B","C","D"),~.x/Total) %>%
group_by(interval) %>% select(c(interval,A:D)) %>%
summarize_all(mean) %>% pivot_longer(-interval) %>%
ggplot(aes(x=interval,y=value,fill=name)) + geom_col()
This shows you for every range of Total, what proportion of A/B/C/D contributes to it..
Upvotes: 2