CRSouser
CRSouser

Reputation: 678

Summing Columns conditionally prior to use in HeatMap GGPLOT2 in R?

I am creating a HeatMap in R from a dataset of about 7000 entries using ggplot2 in R but I am not feeling comfortable with the results due to the legend of my plot stating "Vol" values I know single instances have, and I am wanting to plot an aggregate volume for the Day of Month / Day of Week combination.

So my newbie assumption is ggplot2 does not auto-aggregate the values, or I am missing something prior to plotting.

Basically I want something similar to the following in Excel to be done on all 217 cells at the point of plotting or before.

=SUMIFS(C2:C9999,D2:D9999,"1",F2:F9999,"Wed")

Which for the first day of the Month that is on a Wednesday for example Excel states it is "6098822" and "6756753" for a 1st day of the month on a Thursday; more than the maximum density of the HeatMap below.

Most likely I am just missing a basic aggregation function to produce the correct aggregation combinations.

Can someone point out what I should be doing and explain exactly what GGPLOT2 is doing below such as is it using the first combination it finds for the plot?

Note: DoY ->1,365 DofM->1-31 DofW->Sun-Sat

My data looks like the following:

> head(AC3,10)
   DD     Date   Vol DofM DoY DofW
1   1 01/01/97 28857    1   1  Wed
2   2 01/02/97 37757    2   2  Thu
3   3 01/03/97 41394    3   3  Fri
4   4 01/04/97 39614    4   4  Sat
5   5 01/05/97 32735    5   5  Sun
6   6 01/06/97 33536    6   6  Mon
7   7 01/07/97 33547    7   7  Tue
8   8 01/08/97 34205    8   8  Wed
9   9 01/09/97 35804    9   9  Thu
10 10 01/10/97 42895   10  10  Fri


>p<-ggplot(AC3,aes(DofM,DofW))
>p + geom_tile(aes(fill=Vol)) + scale_fill_gradient(low="white",   high="darkblue") + xlab("Day of Month") + ylab("Day of Week")

This produces the graph:

enter image description here

Upvotes: 0

Views: 344

Answers (1)

tystanza
tystanza

Reputation: 161

You can test what ggplot is doing by contriving the data to make it easy to work out how ggplot is treating it.

For example, below is R-code to create a dataset that has the top 4 rows repeated on the last four rows. However, the Vol values are negative on the repeated bottom 4 rows. Run this to see:

# take in data similar to what was show in question
AC3<-scan(what=list(id=0,DD=0,Date="",Vol=0,DofM=0,DoY=0,DofW=""))
1   1 01/01/97 28857    1   1  Wed
2   2 01/02/97 37757    2   2  Thu
3   3 01/03/97 41394    3   3  Fri
4   4 01/04/97 39614    4   4  Sat
5   5 01/01/97 -8857    1   1  Wed
6   6 01/02/97 -7757    2   2  Thu
7   7 01/03/97 -1394    3   3  Fri
8   8 01/04/97 -9614    4   4  Sat

#turn into data.frame for ggplot
AC3<-as.data.frame(AC3)

# do the plottin'
require(ggplot2)
ggp2 <- ggplot(AC3,aes(DofM,DofW))
ggp2 <- ggp2 + geom_tile(aes(fill=Vol)) 
ggp2 <- ggp2 + scale_fill_gradient(low="white",high="darkblue") 
ggp2 <- ggp2 + xlab("Day of Month") 
ggp2 <- ggp2 + ylab("Day of Week")
print(ggp2)

You can see ggplot just uses the last Vol value of combination DofM,DofW. Now to get to what you really want to do, sum up the repeated Vol values for DofM,DofW combinations. My weapon of choice is the R library sqldf (SQL data frame). It's a way of using SQL in R.

# need to install sqldf: install.packages("sqldf",dep=TRUE)
require(sqldf)
# sql query to add all Vol values for unique DofM, DofW combinations
AC4<-sqldf(paste("select DofM, DofW, sum(Vol) as newvol"
                ,"from AC3"
                ,"group by DofM, DofW"
                ))
# see if you agree with the output:
print(AC4)

Upvotes: 1

Related Questions