Ra-v
Ra-v

Reputation: 1

How to make density histogram divided up on second value in ggplot2?

I have a problem with my density histogram in ggplot2. I am working in RStudio, and I am trying to create density histogram of income, dependent on persons occupation. My problem is, that when I use my code:

data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
        sep=",",header=F,col.names=c("age", "type_employer", "fnlwgt", "education", 
                "education_num","marital", "occupation", "relationship", "race","sex",
                "capital_gain", "capital_loss", "hr_per_week","country", "income"),
        fill=FALSE,strip.white=T)

ggplot(data=dat, aes(x=income)) + 
  geom_histogram(stat='count', 
                 aes(x= income, y=stat(count)/sum(stat(count)), 
                     col=occupation, fill=occupation),
                 position='dodge')

I get in response histogram of each value divided by overall count of all values of all categories, and I would like for example for people earning >50K whom occupation is 'craft repair' divided by overall number of people whos occupation is craft-repair, and the same for <=50K and of the same occupation category, and like that for every other type of occupation

And the second question is, after doing propper density histogram, how can I sort the bars in decreasing order?

Upvotes: 0

Views: 635

Answers (1)

Mako212
Mako212

Reputation: 7312

This is a situation where it makes sence to re-aggregate your data first, before plotting. Aggregating within the ggplot call works fine for simple aggregations, but when you need to aggregate, then peel off a group for your second calculation, it doesn't work so well. Also, note that because your x axis is discrete, we don't use a histogram here, instead we'll use geom_bar()

First we aggregate by count, then calculate percent of total using occupation as the group.

d2 <- data %>% group_by(income, occupation) %>% 
  summarize(count= n()) %>% 
  group_by(occupation) %>% 
  mutate(percent = count/sum(count))

Then simply plot a bar chart using geom_bar and position = 'dodge' so the bars are side by side, rather than stacked.

 d2 %>% ggplot(aes(income, percent, fill = occupation)) + 
   geom_bar(stat = 'identity', position='dodge')

enter image description here

Upvotes: 2

Related Questions