Reputation: 3
I am trying to create a bar chart in ggplot2 that limits output on the x-axis to the top-10% most frequent categorical variables.
My dataframe is a dataset that contains statistics on personal loans. I am examining the relationship between two categories, Loan Status and Occupation.
First, I want to limit Loan Status to loans that have been "charged off." Next, I want to plot how many loans have been charged off across various occupations using a bar chart. There are 67 unique values for Occupation - I want to limit the plot to only the most frequent occupations (by integer or percentage, i.e. "7" or "10%" works).
In the code below, I am using the forcats function fct_infreq
to order the bar chart by frequency in descending order. However, I cannot find a function to limit the number of x-axis categories. I have experimented with quantile
, scale_x_discrete
, etc. but those don't seem to work for categorical data.
Thanks for your help!
df %>% filter(LoanStatus %in% c("Chargedoff")) %>%
ggplot() +
geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
scale_x_discrete(limits = c(quantile(df$Occupation, 0.9), quantile(df$Occupation, 1)))
Resulting error:
Error in (1 - h) * qs[i] : non-numeric argument to binary operator
UPDATE: Using Yifu's answer below, I was able to get the desired output like this:
pd_occupation <- pd %>%
dplyr::filter(LoanStatus == "Chargedoff") %>%
group_by(Occupation) %>%
mutate(group_num = n())
table(pd_occupation$group_num)#to view the distribution
ggplot(subset(pd_occupation, group_num >= 361)) +
geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
ggtitle('Loan Charge-Offs by Occupation')
Upvotes: 0
Views: 1870
Reputation: 6106
You can do it in dplyr
instead:
#only use cars whose carb appears more than 7 times to create a plot
mtcars %>%
group_by(carb) %>%
mutate(group_num = n()) %>%
# you can substitute the number with 10% percentitle or whatever you want
dplyr::filter(group_num >= 7) #%>%
#ggplot()
#create your plot
The idea is to filter the observations and pass it to ggplot
rather than filter data in ggplot
.
Upvotes: 1