erickfis
erickfis

Reputation: 1204

ggplot2: comparing 2 groups through fraction of its members

Lets say we have 10000 users classified in 2 groups: lvl beginner and lvl pro.

Every user has a rank, going from 1 to 20.

The df:

# beginers
n <- 7000
user.id <- 1:n
lvl <- "beginer"
rank <- sample(1:20, n, replace = TRUE,
               prob = seq(.9,0.1,length.out = 20))
df.beginer <- data.frame(user.id, rank, lvl)

# pros
n <- 3000
user.id <- 1:n
lvl <- "pro"
rank <- sample(1:20, n, replace = TRUE,
               prob = seq(.9,0.3,length.out = 20))
df.pro <- data.frame(user.id, rank, lvl)

library(dplyr)
df <- bind_rows(df.beginer, df.pro)
df2 <- tbl_df(df) %>% group_by(lvl, rank) %>% mutate(count = n())

Problem 1: I need a bar plot comparing each group side by side, but instead if giving me counts, I need percents, so the bars from each group will have the same max hight (100%)

The plot I got so far:

library(ggplot2)
plot <- ggplot(df2, aes(rank))
plot + geom_bar(aes(fill=lvl),  position="dodge")

barplot

Problem 2:

I need a line plot comparing each group, so we will have 2 lines, but instead if giving me counts, I need percents, so the lines from each group will have the same max hight (100%)

The plot I got so far:

plot + geom_line(aes(y=count, color=lvl))

lines

Problem 3:

Lets say that the ranks are cumulative, so a user who has rank 3, also has rank 1 and 2. A user who has rank 20 has all ranks from 1 to 20.

So, when plotting, I want the plot to start with rank 1 having 100% of users, rank 2 will have something less, rank 3 even less and so on.

I got all this done on tableau but I really dislike it and want to show myself that R can handle all this stuff.

Thank you!

Upvotes: 0

Views: 1231

Answers (1)

bouncyball
bouncyball

Reputation: 10761

Three problems, three solutions:

problem 1 - calculate percentage and use geom_col

df %>%
  group_by(rank, lvl)%>%
  summarise(count = n()) %>%
  group_by(lvl) %>%
  mutate(count_perc = count / sum(count)) %>% # calculate percentage
  ggplot(., aes(x = rank, y = count_perc))+
  geom_col(aes(fill = lvl), position = 'dodge')

enter image description here

problem 2 - pretty much the same as problem 1 except use geom_line instead of geom_col

df %>%
  group_by(rank, lvl)%>%
  summarise(count = n()) %>%
  group_by(lvl) %>%
  mutate(count_perc = count / sum(count)) %>%
  ggplot(., aes(x = rank, y = count_perc))+
  geom_line(aes(colour = lvl))

enter image description here

problem 3 - make use of arrange and cumsum

df %>%
  group_by(lvl, rank) %>%
  summarise(count = n()) %>% # count by level and rank
  group_by(lvl) %>%
  arrange(desc(rank)) %>% # sort descending
  mutate(cumulative_count = cumsum(count)) %>% # use cumsum
  mutate(cumulative_count_perc = cumulative_count / max(cumulative_count)) %>%
  ggplot(., aes(x = rank, y = cumulative_count_perc))+
  geom_line(aes(colour = lvl))

enter image description here

Upvotes: 4

Related Questions