Combined frequency histogram using two attributes

Question

I'm using ggplot2 to create histograms for two different parameters. My current approach is attached at the end of my question (including a dataset, which can be used and loaded right from pasetbin.com), which creates

a histrogram visualizing the frequency for the spatial distribution of logged user data based on the "location"-attribute (either "WITHIN" or "NOT_WITHIN").
a histogram visualizing the frequency for the distribution of logged user data based on the "context"-attribute (either "Clicked A" or "Clicked B").

This looks like the follwoing:

# Load my example dataset from pastebin
RawDataSet <- read.csv("http://pastebin.com/raw/uKybDy03", sep=";")
# Load packages
library(plyr)
library(dplyr)
library(reshape2)
library(ggplot2)

###### Create Frequency Table for Location-Information
LocationFrequency <- ddply(RawDataSet, .(UserEmail), summarize, 
                           All = length(UserEmail),
                           Within_area = sum(location=="WITHIN"),
                           Not_within_area = sum(location=="NOT_WITHIN"))
# Create a column for unique identifiers
LocationFrequency <- mutate(LocationFrequency, id = rownames(LocationFrequency))
# Reorder columns
LocationFrequency <- LocationFrequency[,c(5,1:4)]
# Format id-column as numbers (not as string)
LocationFrequency[,c(1)] <- sapply(LocationFrequency[, c(1)], as.numeric)
# Melt data
LocationFrequency.m = melt(LocationFrequency, id.var=c("UserEmail","All","id"))
# Plot data
p <- ggplot(LocationFrequency.m, aes(x=id, y=value, fill=variable)) +
  geom_bar(stat="identity") +
  theme_grey(base_size = 16)+
  labs(title="Histogram showing the distribution of all spatial information per user.") + 
  labs(x="User", y="Number of notifications interaction within/not within the area") +
  # using IDs instead of UserEmail
  scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30), labels=c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30"))
# Change legend Title
p + labs(fill = "Type of location")



##### Create Frequency Table for Interaction-Information
InterationFrequency <- ddply(RawDataSet, .(UserEmail), summarize, 
                             All = length(UserEmail),
                             Clicked_A = sum(context=="Clicked A"),
                             Clicked_B = sum(context=="Clicked B"))
# Create a column for unique identifiers
InterationFrequency <- mutate(InterationFrequency, id = rownames(InterationFrequency))
# Reorder columns
InterationFrequency <- InterationFrequency[,c(5,1:4)]
# Format id-column as numbers (not as string)
InterationFrequency[,c(1)] <- sapply(InterationFrequency[, c(1)], as.numeric)
# Melt data
InterationFrequency.m = melt(InterationFrequency, id.var=c("UserEmail","All","id"))
# Plot data
p <- ggplot(InterationFrequency.m, aes(x=id, y=value, fill=variable)) +
  geom_bar(stat="identity") +
  theme_grey(base_size = 16)+
  labs(title="Histogram showing the distribution of all interaction types per user.") + 
  labs(x="User", y="Number of interaction") +
  # using IDs instead of UserEmail 
  scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30), labels=c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30"))
  # Change legend Title
  p + labs(fill = "Type of interaction")

But what I'm trying to realize: How can I combine both histograms in only one plot? Would it be somehow possible to place the corressponding percentage for each part? Somethink like the following sketch, which represents the total number of observations per user (the complete height of the bar) and using the different segmentation to visualize the corresponding data. Each bar would be divided into to parts (within and not_within) where each part would be then divided into two subparts showing the percentage of the interaction types (*Clicked A' or Clicked B).

Jaap · Accepted Answer

With the update description, I would make a combined barplot with two parts: a negative and a positve one. In order to achieve that, you have to get your data into the correct format:

# load needed libraries
library(dplyr)
library(tidyr)
library(ggplot2)

# summarise your data
new.df <- RawDataSet %>% 
  group_by(UserEmail,location,context) %>% 
  tally() %>%
  mutate(n2 = n * c(1,-1)[(location=="NOT_WITHIN")+1L]) %>%
  group_by(UserEmail,location) %>%
  mutate(p = c(1,-1)[(location=="NOT_WITHIN")+1L] * n/sum(n))

The new.df dataframe looks like:

> new.df
Source: local data frame [90 x 6]
Groups: UserEmail, location [54]

   UserEmail   location   context     n    n2          p
      (fctr)     (fctr)    (fctr) (int) (dbl)      (dbl)
1      andre NOT_WITHIN Clicked A     3    -3 -1.0000000
2       bibi NOT_WITHIN Clicked A     4    -4 -0.5000000
3       bibi NOT_WITHIN Clicked B     4    -4 -0.5000000
4       bibi     WITHIN Clicked A     9     9  0.6000000
5       bibi     WITHIN Clicked B     6     6  0.4000000
6     corinn NOT_WITHIN Clicked A    10   -10 -0.5882353
7     corinn NOT_WITHIN Clicked B     7    -7 -0.4117647
8     corinn     WITHIN Clicked A     9     9  0.7500000
9     corinn     WITHIN Clicked B     3     3  0.2500000
10  dpfeifer NOT_WITHIN Clicked A     7    -7 -1.0000000
..       ...        ...       ...   ...   ...        ...

Next you can create a plot with:

ggplot() +
  geom_bar(data = new.df[new.df$location == "NOT_WITHIN",],
           aes(x = UserEmail, y = n2, color = "darkgreen", fill = context),
           size = 1, stat = "identity", width = 0.7) +
  geom_bar(data = new.df[new.df$location == "WITHIN",],
           aes(x = UserEmail, y = n2, color = "darkred", fill = context),
           size = 1, stat = "identity", width = 0.7) +
  scale_y_continuous(breaks = seq(-20,20,5),
                     labels = c(20,15,10,5,0,5,10,15,20)) +
  scale_color_manual("Location of interaction",
                     values = c("darkgreen","darkred"),
                     labels = c("NOT_WITHIN","WITHIN")) +
  scale_fill_manual("Type of interaction",
                    values = c("lightyellow","lightblue"),
                    labels = c("Clicked A","Clicked B")) +
  guides(color = guide_legend(override.aes = list(color = c("darkred","darkgreen"),
                                                  fill = NA, size = 2), reverse = TRUE),
         fill = guide_legend(override.aes = list(fill = c("lightyellow","lightblue"),
                                                 color = "black", size = 0.5))) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 14),
        axis.title = element_blank(),
        legend.title = element_text(face = "italic", size = 14),
        legend.key.size = unit(1, "lines"),
        legend.text = element_text(size = 11))

which results in:

If you want to use percentage values, you can use the p-column to make a plot:

ggplot() +
  geom_bar(data = new.df[new.df$location == "NOT_WITHIN",],
           aes(x = UserEmail, y = p, color = "darkgreen", fill = context),
           size = 1, stat = "identity", width = 0.7) +
  geom_bar(data = new.df[new.df$location == "WITHIN",],
           aes(x = UserEmail, y = p, color = "darkred", fill = context),
           size = 1, stat = "identity", width = 0.7) +
  scale_y_continuous(breaks = c(-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1),
                     labels = scales::percent(c(1,0.75,0.5,0.25,0,0.25,0.5,0.75,1))) +
  scale_color_manual("Location of interaction",
                     values = c("darkgreen","darkred"),
                     labels = c("NOT_WITHIN","WITHIN")) +
  scale_fill_manual("Type of interaction",
                    values = c("lightyellow","lightblue"),
                    labels = c("Clicked A","Clicked B")) +
  coord_flip() +
  guides(color = guide_legend(override.aes = list(color = c("darkred","darkgreen"),
                                                  fill = NA, size = 2), reverse = TRUE),
         fill = guide_legend(override.aes = list(fill = c("lightyellow","lightblue"),
                                                 color = "black", size = 0.5))) +
  theme_minimal(base_size = 14) +
  theme(axis.title = element_blank(),
        legend.title = element_text(face = "italic", size = 14),
        legend.key.size = unit(1, "lines"),
        legend.text = element_text(size = 11))

which results in:

In response to the comment

If you want to place the text-labels inside the bars, you will have to calculate a position variable too:

new.df <- RawDataSet %>% 
  group_by(UserEmail,location,context) %>% 
  tally() %>%
  mutate(n2 = n * c(1,-1)[(location=="NOT_WITHIN")+1L]) %>%
  group_by(UserEmail,location) %>%
  mutate(p = c(1,-1)[(location=="NOT_WITHIN")+1L] * n/sum(n),
         pos = (context=="Clicked A")*p/2 + (context=="Clicked B")*(c(1,-1)[(location=="NOT_WITHIN")+1L] * (1 - abs(p)/2)))

Then add the following line to your ggplot code after the geom_bar's:

geom_text(data = new.df, aes(x = UserEmail, y = pos, label = n))

which results in:

Instead of label = n you can also use label = scales::percent(abs(p)) to display the percentages.

Combined frequency histogram using two attributes

Answers (1)

Related Questions