Reputation: 828
I'm working with the Kickstarter Dataset from Kaggle and I would like to create meaningful visualization with ggplot
about how display the project data about pledge ratios (this is a field I added, which is calculated by dividing the USD Pledged amount by the USD Goal amount, per project.
To replicate the dataset I'm using in R, please use the following code:
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(ggplot2)
library(dplyr)
file_path <- "https://raw.githubusercontent.com/mganopolsky/kickstarter/master/data/ks-projects-201801.csv"
data <-read_csv(file_path)
ds <- data %>% dplyr::select(-`usd pledged`)
ds <- ds %>% mutate(time_int = as.numeric(deadline - as.Date(launched)) ,
launched = as.Date(launched),
pledged_ratio = round(usd_pledged_real / usd_goal_real, 2),
avg_backer_pldg = ifelse(backers == 0, 0, round(usd_pledged_real/backers) )) %>%
mutate(launched_month = as.factor(format(launched, "%m")),
launched_day_of_week = as.factor(format(launched, "%u") ),
currency = as.factor(currency),
launched_year = as.factor(format(launched, "%Y")))
ds <- ds %>% filter(launched >= "2009-04-21")
At this point, I would like a visualization of what kind of pledge_ratio we can see across projects. This data can be viewed with the following code:
ds %>% filter(state=="successful" ) %>% group_by(pledged_ratio) %>% summarise( pledged_ratio_count = n()) %>%
arrange(desc(pledged_ratio))
This gives an idea of how many projects fall into a specific ratio - however, this number isn't really meaningful. A binned display of some sort would be much more preferable - for instance, using a geom_histogram()
, or even a geom_density()
.
When I run the density plot, the result looks like this:
ds %>% filter(state=="successful" ) %>%
arrange(desc(pledged_ratio)) %>% ggplot(aes(pledged_ratio)) + geom_density() +
ggtitle("Density Distribution of Pledge Ratios for Succeessful Projects") + xlab("Pledge Ratios")
This makes sense once you stare at it for a while, because most of the projects get funded at around 100%, or a ratio of 1. However, there are some that get funded at much higher rates, and I want a visualization that will show that in a way that's not meaningless.
I have tried this with histograms:
ds %>% filter(state=="successful" ) %>%
arrange(desc(pledged_ratio)) %>% ggplot(aes(pledged_ratio)) + geom_histogram(bins = 20)
and this produced another somewhat meaningless histogram:
Finally, using geom_point() I got this:
ds %>% filter(state=="successful" ) %>% group_by(pledged_ratio) %>% summarise( pledged_ratio_count = n()) %>%
arrange(desc(pledged_ratio)) %>% ggplot(aes(pledged_ratio, y=pledged_ratio_count)) + geom_point()
And that resulted in this, may be the most insightful graph so far. :
However, I am still convinced that there's got to be a better way to convey what the data is telling. Any advice would be greatly appriciated.
Upvotes: 0
Views: 89
Reputation: 21937
What about an empirical CDF?
library(scales)
ds %>% filter(state=="successful") %>%
ggplot(aes(x=pledged_ratio)) +
stat_ecdf() +
scale_x_continuous(trans="pseudo_log", breaks = c(10, 100, 1000, 10000, 100000), labels=comma) +
scale_y_continuous(labels=percent) +
theme_bw() +
labs(x="Pledged Ratio", y="Percentage of Projects")
Upvotes: 1