MikeTP
MikeTP

Reputation: 7986

How to use ggplot to group and show top X categories?

I am trying to use use ggplot to plot production data by company and use the color of the point to designate year. The follwoing chart shows a example based on sample data: enter image description here

However, often times my real data has 50-60 different comapnies wich makes the Company names on the Y axis to be tiglhtly grouped and not very asteticly pleaseing.

What is th easiest way to show data for only the top 5 companies information (ranked by 2011 quanties) and then show the rest aggregated and shown as "Other"?

Below is some sample data and the code I have used to create the sample chart:

# create some sample data
c=c("AAA","BBB","CCC","DDD","EEE","FFF","GGG","HHH","III","JJJ")

q=c(1,2,3,4,5,6,7,8,9,10)
y=c(2010)
df1=data.frame(Company=c, Quantity=q, Year=y)

q=c(3,4,7,8,5,14,7,13,2,1)
y=c(2011)
df2=data.frame(Company=c, Quantity=q, Year=y)

df=rbind(df1, df2)

# create plot
p=ggplot(data=df,aes(Quantity,Company))+
  geom_point(aes(color=factor(Year)),size=4)
p

I started down the path of a brute force approach but thought there is probably a simple and elegent way to do this that I should learn. Any assistance would be greatly appreciated.

Upvotes: 4

Views: 15789

Answers (2)

Sandy Muspratt
Sandy Muspratt

Reputation: 32859

See if this is what you want. It takes your df dataframe, and some of the ideas already suggested by @cbeleites. The steps are:

1.Select 2011 data and order the companies from highest to lowest on Quantity.

2.Split df into two bits: dftop which contians the data for the top 5; and dfother, which contains the aggregated data for the other companies (using ddply() from the plyr package).

3.Put the two dataframes together to give dfnew.

4.Set the order for which levels of Company are plotted: Top to bottom is highest to lowest, then "Other". The order is partly given by companies, plus "Other".

5.Plot as before.

library(ggplot2)
library(plyr)

# Step 1
df2011 <- subset (df, Year == 2011)
companies <- df2011$Company [order (df2011$Quantity, decreasing = TRUE)]

# Step 2
dftop = subset(df, Company %in% companies [1:5])
dftop$Company = droplevels(dftop$Company)

dfother = ddply(subset(df, !(Company %in% companies [1:5])), .(Year), summarise, Quantity = sum(Quantity))
dfother$Company = "Other"

# Step 3
dfnew = rbind(dftop, dfother)

# Step 4
dfnew$Company = factor(dfnew$Company, levels = c("Other", rev(as.character(companies)[1:5])))
levels(dfnew$Company)    # Check that the levels are in the correct order

# Step 5
p = ggplot (data = dfnew, aes (Quantity, Company)) +
        geom_point (aes (color = factor (Year)), size = 4)
p

The code produces:

enter image description here

Upvotes: 3

cbeleites
cbeleites

Reputation: 14093

What about this:

    df2011 <- subset (df, Year == 2011)
    companies <- df2011$Company [order (df2011$Quantity, decreasing = TRUE)]
    ggplot (data = subset (df, Company %in% companies [1 : 5]), 
            aes (Quantity, Company)) +
            geom_point (aes (color = factor (Year)), size = 4)

BTW: in order for the code to be called elegant, spend a few more spaces, they aren't that expensive...

Upvotes: 6

Related Questions