Reputation: 4453
I am using RStudio
to do my R coding. I have a data set (called mydata2) and I am using this dataframe to build a plot in ggplot2.
library(ggplot2)
mydata = read.csv("extrasjan15feb17.csv")
mydata2=mydata[(mydata$PropertyCode = "PLN" & mydata$Year==2016), ]
options(scipen=99)
ggplot(mydata2,aes(Year, TotalSpending)) + geom_jitter(size=2,alpha=0.5)+
scale_y_continuous(breaks=number_ticks(20),
limits = c(min=0,max=254000))+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
The above codes give me the following plot:
Basically, the graph is showing the plotting of all the values in the 'TotalSpending' column of 'mydata2' dataframe.
Now, my challenge is that I want the top 20 percent of these values to appear in a different color in the plot. How do I tackle this challenge?
I was thinking about creating a new column in the dataframe with values like 'Top 20 Percent' and 'Other' appearing for each row in the distribution and then using that new column as the basis for 'Color' in my ggplot2 codes. However, I have no clue as to how to do it. Or may be I am completely on the wrong track and there is another method of achieving this.
Any help would be highly appreciated.
Upvotes: 1
Views: 726
Reputation: 13817
# get a sample data
data("mtcars")
# create dummy variable
mtcars$percentile20 <- ifelse(mtcars$qsec > quantile(mtcars$qsec, 0.2), T, F)
# plot
ggplot() +
geom_point(data=mtcars, aes(hp, qsec, color=percentile20)) +
scale_color_manual(values = c("black", "red"))
As mentioned by @Steven in the comment, if you don't want to create a new column, you can just do this and the result will be the same:
ggplot() +
geom_point(data=mtcars, aes(hp, qsec, color=qsec > quantile(qsec, prob=0.2))) +
scale_color_manual(values = c("black", "red"))
Upvotes: 0
Reputation: 8317
You can mutate
a new column using dplyr
to indicate whether or not a given row is in the top 20%. You can color your data points based on the value of that row.
library(tidyverse) # Contains ggplot2 and so much more
# I don't have access to the CSV so here's some random data
mydata2 = tibble(TotalSpending = abs(rnorm(500)), Year = runif(500, min = 1900, max = 2000))
# I assume you're using this function from another StackOverflow answer?
number_ticks <- function(n) {function(limits) pretty(limits, n)}
# Create a new variable indicating whether or not a given value is in the top 20%
mydata2 <- mydata2 %>%
mutate(top20 = percent_rank(TotalSpending) > 0.199)
# Specify color = top20 in aes()
options(scipen=99)
ggplot(mydata2,aes(Year, TotalSpending, color = top20)) +
geom_jitter(size=2,alpha=0.5)+
scale_y_continuous(breaks=number_ticks(20),
limits = c(min=0))+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
I'm not familiar with the function number_ticks
. I found it defined in another StackOverflow question so I copied that function definition into my answer.
Upvotes: 1