Jason Donnald
Jason Donnald

Reputation: 2316

Finding correlation between two data sets in R

UPDATED DATA SET 2 and 1 STRUCTURE: Sorry for this sudden update. I have two data sets. The structure of my first data set is (when using print(matr1) in R):

        month_year  income
 [1,]  "Jan 2000"  "30000"
 [2,]  "Feb 2000"  "12364"
 [3,]  "Mar 2000"  "37485"
 [4,]  "Apr 2000"  "2000"
 [5,]  "Jun 2000"  "7573"
          .     .      .
          .     .      .

Thus the first data set has one income value for each month of each year.

The structure of my second data set is (when using print(matr2) in R):

     month_year     value
 [1,] "Jan 2000" "84737476"
 [2,] "Jan 2000" "39450334"
 [3,] "Jan 2000" "48384943"
 [4,] "Feb 2000" "12345678"
 [5,] "Feb 2000" "49595340"

          .     .      .
          .     .      .

So in this second data set I have n(say 100 but not constant all the time) number of values for each month of each year.

Both the data sets have values month wise for many subsequent years(like for all the months for year 2000, 2001 and so on). Now I want to find the correlation between these two data sets but month-wise and not as whole. When I use the R command cor(as.numeric(matr1[,"income"]),as.numeric(matr2[,"value"])) then I get the overall correlation but I want the correlation per month rather than as a whole. I want the correlation something like this:

                  Jan | Feb | Mar | Apr | May | .....
Correlation        x  |  y  |  z  |  p  |  q  | .....

The issue that I have is:

  1. How do I get the correlation value per month instead of the overall correlation?

NOTE: I am not sure whether I should have posted this question here or on Cross Validated. I had posted a question for this data set only regarding an error in getting correlation and it was migrated from there to here. So please forgive if I am posting this on wrong place.

UPDATE1: After some suggestion I have modified this post to point in the correct dimension. First of all, the data sets as of now are in matrix format and hence the quotes. I can convert it to data.frame as suggested by some comments but right now I have been calculating the correlation by converting the columns using as.numeric.

Upvotes: 4

Views: 11927

Answers (2)

akrun
akrun

Reputation: 886938

May be you can try:

dat1 <- structure(list(year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 
2001L, 2001L, 2001L, 2001L), month = c(1L, 2L, 3L, 4L, 5L, 1L, 
2L, 3L, 4L, 5L), income = c(30000L, 12364L, 37485L, 2000L, 7573L, 
25000L, 14364L, 38485L, 4000L, 7873L)), .Names = c("year", "month", 
"income"), class = "data.frame", row.names = c(NA, -10L))

dat2 <- structure(list(month_year = c("Jan 2000", "Feb 2000", "Mar 2000", 
"Apr 2000", "May 2000", "Jan 2001", "Feb 2001", "Mar 2001", "Apr 2001", 
"May 2001"), value = c(84737476L, 39450334L, 48384943L, 12345678L, 
49595340L, 84337476L, 34450334L, 48984943L, 124545678L, 49525340L
)), .Names = c("month_year", "value"), class = "data.frame", row.names = c(NA, 
-10L))



 dat1$month_year <- paste(month.abb[dat1$month], dat1$year)
 dat1$month <- gsub(" \\d+","", dat1$month_year)
 dat2$month <- gsub(" \\d+","", dat2$month_year)
 dat1$indx <- with(dat1, ave(month, month, FUN=seq_along))
 dat2$indx <- with(dat2, ave(month, month, FUN=seq_along))
 dat1 <- dat1[,c(2,3,5)]
 dat2 <- dat2[,c(3,2,4)]
 colnames(dat2)[2] <- "income"

 library(reshape2)

 dat2C <- dcast(dat2, indx~month, value.var="income")
 dat1C <- dcast(dat1, indx~month, value.var="income")
 m1 <- as.matrix(dat1C[,-1])
 m2 <- as.matrix(dat2C[,-1])
 cor(m1,m2)
  diag(cor(m1,m2))
 # Apr Feb Jan Mar May 
  #1   -1   1   1  -1 

Also, if you can merge the two datasets together, this could be done using data.table. Using the dput() data above

 library(data.table)
 dat1$month_year <- paste(month.abb[dat1$month], dat1$year)
 dat1 <- dat1[,c(4,3)]
 setDT(dat1)
 setDT(dat2)
 setkey(dat2, month_year)

 dat2[dat1, income := i.income]
 dat2[,month:= gsub(" \\d+", "", month_year)][,cor(value, income), by=month] 
 #    month V1
 #1:   Apr  1
 #2:   Feb -1
 #3:   Jan  1
 #4:   Mar  1
 #5:   May -1

Update

dat1 <- structure(list(month_year = structure(c(5L, 3L, 8L, 1L, 7L, 6L, 
4L, 9L, 2L), .Label = c("Apr 2000", "Apr 2001", "Feb 2000", "Feb 2001", 
"Jan 2000", "Jan 2001", "Jun 2000", "Mar 2000", "Mar 2001"), class = "factor"), 
income = c(30000, 12364, 37485, 2000, 7573, 42000, 15764, 
38465, 5000)), .Names = c("month_year", "income"), row.names = c(NA, 
-9L), class = "data.frame")


 dat2 <-  structure(list(month_year = structure(c(5L, 5L, 5L, 3L, 3L, 7L, 
 7L, 7L, 1L, 1L, 6L, 6L, 4L, 4L, 8L, 8L, 2L, 2L, 2L, 2L), .Label = c("Apr 2000", 
 "Apr 2001", "Feb 2000", "Feb 2001", "Jan 2000", "Jan 2001", "Mar 2000", 
 "Mar 2001"), class = "factor"), value = c(84737476, 39450334, 
 48384973, 12345678, 49595340, 4534353, 43353325, 84333535, 35343232, 
 4334353, 3434353, 5355322, 5223345, 4523535, 345353, 32235, 423553, 
 233553, 423535, 884455)), .Names = c("month_year", "value"), row.names = c(NA, 
 -20L), class = "data.frame")


 datN <- merge(dat1, dat2, all=T)
 library(data.table)
 DT <- data.table(datN)
 DT[, month:= gsub(" \\d+", "", month_year)][,cor(value, income),by=month]
 #   month         V1
 #1:   Apr -0.7136049
 #2:   Feb -0.7037676
 #3:   Jan -0.8637808
 #4:   Jun         NA
 #5:   Mar -0.6484684

Upvotes: 2

Spacedman
Spacedman

Reputation: 94172

Get your data into a data frame with month, value, and income columns. EG:

d = data.frame(month=rep(1:12,5),value=runif(60,10000000,60000000), income=runif(60,5000,40000))

> head(d)
  month    value   income
1     1 58348424 34478.63
2     2 59512513 16179.46
3     3 21844994 20961.56
4     4 25843593 38502.16
5     5 24805863 12397.32
6     6 24200966 24110.27

Then its as simple as using dplyr to group by month and summarize:

> require(dplyr)
> d %.% group_by(month) %.% summarize(cor = cor(value, income))
Source: local data frame [12 x 2]

   month         cor
1      1  0.17774478
2      2 -0.61693145
3      3 -0.05692027
4      4 -0.44966542
5      5 -0.30049386
6      6  0.09447414
7      7  0.67567298
8      8  0.14363810
9      9 -0.71899361
10    10  0.20807679
11    11 -0.42560100
12    12  0.23584150

Getting the month number from a date string is covered in many other places... but here I'd use the lubridate package. For the month/year strings in your second data set, for example:

require(lubridate)
month(dmy(paste("01",dat2$month_year)))

returns the month number. Note the trick of sticking a "01" on the start to make it a valid date.

Upvotes: 0

Related Questions