Reputation: 571
I'm using ggplot2 to create histograms for two different parameters. My current approach is attached at the end of my question (including a dataset, which can be used and loaded right from pasetbin.com), which creates
This looks like the follwoing:
# Load my example dataset from pastebin
RawDataSet <- read.csv("http://pastebin.com/raw/uKybDy03", sep=";")
# Load packages
library(plyr)
library(dplyr)
library(reshape2)
library(ggplot2)
###### Create Frequency Table for Location-Information
LocationFrequency <- ddply(RawDataSet, .(UserEmail), summarize,
All = length(UserEmail),
Within_area = sum(location=="WITHIN"),
Not_within_area = sum(location=="NOT_WITHIN"))
# Create a column for unique identifiers
LocationFrequency <- mutate(LocationFrequency, id = rownames(LocationFrequency))
# Reorder columns
LocationFrequency <- LocationFrequency[,c(5,1:4)]
# Format id-column as numbers (not as string)
LocationFrequency[,c(1)] <- sapply(LocationFrequency[, c(1)], as.numeric)
# Melt data
LocationFrequency.m = melt(LocationFrequency, id.var=c("UserEmail","All","id"))
# Plot data
p <- ggplot(LocationFrequency.m, aes(x=id, y=value, fill=variable)) +
geom_bar(stat="identity") +
theme_grey(base_size = 16)+
labs(title="Histogram showing the distribution of all spatial information per user.") +
labs(x="User", y="Number of notifications interaction within/not within the area") +
# using IDs instead of UserEmail
scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30), labels=c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30"))
# Change legend Title
p + labs(fill = "Type of location")
##### Create Frequency Table for Interaction-Information
InterationFrequency <- ddply(RawDataSet, .(UserEmail), summarize,
All = length(UserEmail),
Clicked_A = sum(context=="Clicked A"),
Clicked_B = sum(context=="Clicked B"))
# Create a column for unique identifiers
InterationFrequency <- mutate(InterationFrequency, id = rownames(InterationFrequency))
# Reorder columns
InterationFrequency <- InterationFrequency[,c(5,1:4)]
# Format id-column as numbers (not as string)
InterationFrequency[,c(1)] <- sapply(InterationFrequency[, c(1)], as.numeric)
# Melt data
InterationFrequency.m = melt(InterationFrequency, id.var=c("UserEmail","All","id"))
# Plot data
p <- ggplot(InterationFrequency.m, aes(x=id, y=value, fill=variable)) +
geom_bar(stat="identity") +
theme_grey(base_size = 16)+
labs(title="Histogram showing the distribution of all interaction types per user.") +
labs(x="User", y="Number of interaction") +
# using IDs instead of UserEmail
scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30), labels=c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30"))
# Change legend Title
p + labs(fill = "Type of interaction")
But what I'm trying to realize: How can I combine both histograms in only one plot? Would it be somehow possible to place the corressponding percentage for each part? Somethink like the following sketch, which represents the total number of observations per user (the complete height of the bar) and using the different segmentation to visualize the corresponding data. Each bar would be divided into to parts (within and not_within) where each part would be then divided into two subparts showing the percentage of the interaction types (*Clicked A' or Clicked B).
Upvotes: 2
Views: 1473
Reputation: 83215
With the update description, I would make a combined barplot with two parts: a negative and a positve one. In order to achieve that, you have to get your data into the correct format:
# load needed libraries
library(dplyr)
library(tidyr)
library(ggplot2)
# summarise your data
new.df <- RawDataSet %>%
group_by(UserEmail,location,context) %>%
tally() %>%
mutate(n2 = n * c(1,-1)[(location=="NOT_WITHIN")+1L]) %>%
group_by(UserEmail,location) %>%
mutate(p = c(1,-1)[(location=="NOT_WITHIN")+1L] * n/sum(n))
The new.df
dataframe looks like:
> new.df
Source: local data frame [90 x 6]
Groups: UserEmail, location [54]
UserEmail location context n n2 p
(fctr) (fctr) (fctr) (int) (dbl) (dbl)
1 andre NOT_WITHIN Clicked A 3 -3 -1.0000000
2 bibi NOT_WITHIN Clicked A 4 -4 -0.5000000
3 bibi NOT_WITHIN Clicked B 4 -4 -0.5000000
4 bibi WITHIN Clicked A 9 9 0.6000000
5 bibi WITHIN Clicked B 6 6 0.4000000
6 corinn NOT_WITHIN Clicked A 10 -10 -0.5882353
7 corinn NOT_WITHIN Clicked B 7 -7 -0.4117647
8 corinn WITHIN Clicked A 9 9 0.7500000
9 corinn WITHIN Clicked B 3 3 0.2500000
10 dpfeifer NOT_WITHIN Clicked A 7 -7 -1.0000000
.. ... ... ... ... ... ...
Next you can create a plot with:
ggplot() +
geom_bar(data = new.df[new.df$location == "NOT_WITHIN",],
aes(x = UserEmail, y = n2, color = "darkgreen", fill = context),
size = 1, stat = "identity", width = 0.7) +
geom_bar(data = new.df[new.df$location == "WITHIN",],
aes(x = UserEmail, y = n2, color = "darkred", fill = context),
size = 1, stat = "identity", width = 0.7) +
scale_y_continuous(breaks = seq(-20,20,5),
labels = c(20,15,10,5,0,5,10,15,20)) +
scale_color_manual("Location of interaction",
values = c("darkgreen","darkred"),
labels = c("NOT_WITHIN","WITHIN")) +
scale_fill_manual("Type of interaction",
values = c("lightyellow","lightblue"),
labels = c("Clicked A","Clicked B")) +
guides(color = guide_legend(override.aes = list(color = c("darkred","darkgreen"),
fill = NA, size = 2), reverse = TRUE),
fill = guide_legend(override.aes = list(fill = c("lightyellow","lightblue"),
color = "black", size = 0.5))) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 14),
axis.title = element_blank(),
legend.title = element_text(face = "italic", size = 14),
legend.key.size = unit(1, "lines"),
legend.text = element_text(size = 11))
which results in:
If you want to use percentage values, you can use the p
-column to make a plot:
ggplot() +
geom_bar(data = new.df[new.df$location == "NOT_WITHIN",],
aes(x = UserEmail, y = p, color = "darkgreen", fill = context),
size = 1, stat = "identity", width = 0.7) +
geom_bar(data = new.df[new.df$location == "WITHIN",],
aes(x = UserEmail, y = p, color = "darkred", fill = context),
size = 1, stat = "identity", width = 0.7) +
scale_y_continuous(breaks = c(-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1),
labels = scales::percent(c(1,0.75,0.5,0.25,0,0.25,0.5,0.75,1))) +
scale_color_manual("Location of interaction",
values = c("darkgreen","darkred"),
labels = c("NOT_WITHIN","WITHIN")) +
scale_fill_manual("Type of interaction",
values = c("lightyellow","lightblue"),
labels = c("Clicked A","Clicked B")) +
coord_flip() +
guides(color = guide_legend(override.aes = list(color = c("darkred","darkgreen"),
fill = NA, size = 2), reverse = TRUE),
fill = guide_legend(override.aes = list(fill = c("lightyellow","lightblue"),
color = "black", size = 0.5))) +
theme_minimal(base_size = 14) +
theme(axis.title = element_blank(),
legend.title = element_text(face = "italic", size = 14),
legend.key.size = unit(1, "lines"),
legend.text = element_text(size = 11))
which results in:
In response to the comment
If you want to place the text-labels inside the bars, you will have to calculate a position variable too:
new.df <- RawDataSet %>%
group_by(UserEmail,location,context) %>%
tally() %>%
mutate(n2 = n * c(1,-1)[(location=="NOT_WITHIN")+1L]) %>%
group_by(UserEmail,location) %>%
mutate(p = c(1,-1)[(location=="NOT_WITHIN")+1L] * n/sum(n),
pos = (context=="Clicked A")*p/2 + (context=="Clicked B")*(c(1,-1)[(location=="NOT_WITHIN")+1L] * (1 - abs(p)/2)))
Then add the following line to your ggplot
code after the geom_bar
's:
geom_text(data = new.df, aes(x = UserEmail, y = pos, label = n))
which results in:
Instead of label = n
you can also use label = scales::percent(abs(p))
to display the percentages.
Upvotes: 3