Michael
Michael

Reputation: 3

R - ggplot2 - limit bar chart output for categorical data

I am trying to create a bar chart in ggplot2 that limits output on the x-axis to the top-10% most frequent categorical variables.

My dataframe is a dataset that contains statistics on personal loans. I am examining the relationship between two categories, Loan Status and Occupation.

First, I want to limit Loan Status to loans that have been "charged off." Next, I want to plot how many loans have been charged off across various occupations using a bar chart. There are 67 unique values for Occupation - I want to limit the plot to only the most frequent occupations (by integer or percentage, i.e. "7" or "10%" works).

In the code below, I am using the forcats function fct_infreq to order the bar chart by frequency in descending order. However, I cannot find a function to limit the number of x-axis categories. I have experimented with quantile, scale_x_discrete, etc. but those don't seem to work for categorical data.

Thanks for your help!

df %>% filter(LoanStatus %in% c("Chargedoff")) %>% 
ggplot() +
  geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
  scale_x_discrete(limits = c(quantile(df$Occupation, 0.9), quantile(df$Occupation, 1)))

Resulting error:

Error in (1 - h) * qs[i] : non-numeric argument to binary operator

UPDATE: Using Yifu's answer below, I was able to get the desired output like this:

pd_occupation <- pd %>% 
  dplyr::filter(LoanStatus == "Chargedoff") %>%
  group_by(Occupation) %>% 
  mutate(group_num = n())

table(pd_occupation$group_num)#to view the distribution

ggplot(subset(pd_occupation, group_num >= 361)) +
  geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
  ggtitle('Loan Charge-Offs by Occupation')

Upvotes: 0

Views: 1870

Answers (1)

Yifu Yan
Yifu Yan

Reputation: 6106

You can do it in dplyr instead:

#only use cars whose carb appears more than 7 times to create a plot
mtcars %>%
    group_by(carb) %>%
    mutate(group_num = n()) %>%
    # you can substitute the number with 10% percentitle or whatever you want
    dplyr::filter(group_num >= 7) #%>%
    #ggplot()
    #create your plot

The idea is to filter the observations and pass it to ggplot rather than filter data in ggplot.

Upvotes: 1

Related Questions