pikovayadama
pikovayadama

Reputation: 828

Data Visualization Clarification in R for a density / histogram plot

I'm working with the Kickstarter Dataset from Kaggle and I would like to create meaningful visualization with ggplot about how display the project data about pledge ratios (this is a field I added, which is calculated by dividing the USD Pledged amount by the USD Goal amount, per project.

To replicate the dataset I'm using in R, please use the following code:

if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")

library(tidyverse)
library(ggplot2)
library(dplyr)

file_path <- "https://raw.githubusercontent.com/mganopolsky/kickstarter/master/data/ks-projects-201801.csv"
data  <-read_csv(file_path)


ds <- data %>% dplyr::select(-`usd pledged`)

ds <- ds %>% mutate(time_int = as.numeric(deadline - as.Date(launched)) ,
                    launched = as.Date(launched),
                    pledged_ratio = round(usd_pledged_real / usd_goal_real, 2),
                    avg_backer_pldg = ifelse(backers == 0, 0, round(usd_pledged_real/backers) )) %>%
  mutate(launched_month = as.factor(format(launched, "%m")),
         launched_day_of_week = as.factor(format(launched, "%u")  ),
         currency = as.factor(currency),
         launched_year = as.factor(format(launched, "%Y")))


ds <- ds %>% filter(launched >= "2009-04-21")

At this point, I would like a visualization of what kind of pledge_ratio we can see across projects. This data can be viewed with the following code:

ds %>% filter(state=="successful" ) %>% group_by(pledged_ratio) %>% summarise( pledged_ratio_count = n()) %>%
  arrange(desc(pledged_ratio)) 

This gives an idea of how many projects fall into a specific ratio - however, this number isn't really meaningful. A binned display of some sort would be much more preferable - for instance, using a geom_histogram(), or even a geom_density().

When I run the density plot, the result looks like this:

ds %>% filter(state=="successful" ) %>% 
  arrange(desc(pledged_ratio))  %>% ggplot(aes(pledged_ratio)) + geom_density() + 
  ggtitle("Density Distribution of Pledge Ratios for Succeessful Projects") + xlab("Pledge Ratios") 

enter image description here

This makes sense once you stare at it for a while, because most of the projects get funded at around 100%, or a ratio of 1. However, there are some that get funded at much higher rates, and I want a visualization that will show that in a way that's not meaningless.

I have tried this with histograms:

ds %>% filter(state=="successful" ) %>% 
  arrange(desc(pledged_ratio))  %>% ggplot(aes(pledged_ratio)) + geom_histogram(bins = 20)

and this produced another somewhat meaningless histogram:

enter image description here

Finally, using geom_point() I got this:

ds %>% filter(state=="successful" ) %>% group_by(pledged_ratio) %>% summarise( pledged_ratio_count = n()) %>%
  arrange(desc(pledged_ratio))  %>% ggplot(aes(pledged_ratio, y=pledged_ratio_count)) + geom_point()

And that resulted in this, may be the most insightful graph so far. :

enter image description here

However, I am still convinced that there's got to be a better way to convey what the data is telling. Any advice would be greatly appriciated.

Upvotes: 0

Views: 89

Answers (1)

DaveArmstrong
DaveArmstrong

Reputation: 21937

What about an empirical CDF?

library(scales)
ds %>% filter(state=="successful") %>% 
  ggplot(aes(x=pledged_ratio)) + 
  stat_ecdf() + 
  scale_x_continuous(trans="pseudo_log", breaks = c(10, 100, 1000, 10000, 100000), labels=comma) + 
  scale_y_continuous(labels=percent) + 
  theme_bw() + 
  labs(x="Pledged Ratio", y="Percentage of Projects")

enter image description here

Upvotes: 1

Related Questions