Reputation: 678
I am creating a HeatMap in R from a dataset of about 7000 entries using ggplot2 in R but I am not feeling comfortable with the results due to the legend of my plot stating "Vol" values I know single instances have, and I am wanting to plot an aggregate volume for the Day of Month / Day of Week combination.
So my newbie assumption is ggplot2 does not auto-aggregate the values, or I am missing something prior to plotting.
Basically I want something similar to the following in Excel to be done on all 217 cells at the point of plotting or before.
=SUMIFS(C2:C9999,D2:D9999,"1",F2:F9999,"Wed")
Which for the first day of the Month that is on a Wednesday for example Excel states it is "6098822" and "6756753" for a 1st day of the month on a Thursday; more than the maximum density of the HeatMap below.
Most likely I am just missing a basic aggregation function to produce the correct aggregation combinations.
Can someone point out what I should be doing and explain exactly what GGPLOT2 is doing below such as is it using the first combination it finds for the plot?
Note: DoY ->1,365 DofM->1-31 DofW->Sun-Sat
My data looks like the following:
> head(AC3,10)
DD Date Vol DofM DoY DofW
1 1 01/01/97 28857 1 1 Wed
2 2 01/02/97 37757 2 2 Thu
3 3 01/03/97 41394 3 3 Fri
4 4 01/04/97 39614 4 4 Sat
5 5 01/05/97 32735 5 5 Sun
6 6 01/06/97 33536 6 6 Mon
7 7 01/07/97 33547 7 7 Tue
8 8 01/08/97 34205 8 8 Wed
9 9 01/09/97 35804 9 9 Thu
10 10 01/10/97 42895 10 10 Fri
>p<-ggplot(AC3,aes(DofM,DofW))
>p + geom_tile(aes(fill=Vol)) + scale_fill_gradient(low="white", high="darkblue") + xlab("Day of Month") + ylab("Day of Week")
This produces the graph:
Upvotes: 0
Views: 344
Reputation: 161
You can test what ggplot
is doing by contriving the data to make it easy to work out how ggplot
is treating it.
For example, below is R-code to create a dataset that has the top 4 rows repeated on the last four rows. However, the Vol
values are negative on the repeated bottom 4 rows. Run this to see:
# take in data similar to what was show in question
AC3<-scan(what=list(id=0,DD=0,Date="",Vol=0,DofM=0,DoY=0,DofW=""))
1 1 01/01/97 28857 1 1 Wed
2 2 01/02/97 37757 2 2 Thu
3 3 01/03/97 41394 3 3 Fri
4 4 01/04/97 39614 4 4 Sat
5 5 01/01/97 -8857 1 1 Wed
6 6 01/02/97 -7757 2 2 Thu
7 7 01/03/97 -1394 3 3 Fri
8 8 01/04/97 -9614 4 4 Sat
#turn into data.frame for ggplot
AC3<-as.data.frame(AC3)
# do the plottin'
require(ggplot2)
ggp2 <- ggplot(AC3,aes(DofM,DofW))
ggp2 <- ggp2 + geom_tile(aes(fill=Vol))
ggp2 <- ggp2 + scale_fill_gradient(low="white",high="darkblue")
ggp2 <- ggp2 + xlab("Day of Month")
ggp2 <- ggp2 + ylab("Day of Week")
print(ggp2)
You can see ggplot just uses the last Vol
value of combination DofM,DofW
. Now to get to what you really want to do, sum up the repeated Vol
values for DofM,DofW
combinations. My weapon of choice is the R
library sqldf
(SQL data frame). It's a way of using SQL in R.
# need to install sqldf: install.packages("sqldf",dep=TRUE)
require(sqldf)
# sql query to add all Vol values for unique DofM, DofW combinations
AC4<-sqldf(paste("select DofM, DofW, sum(Vol) as newvol"
,"from AC3"
,"group by DofM, DofW"
))
# see if you agree with the output:
print(AC4)
Upvotes: 1