Reputation: 2316
UPDATED DATA SET 2 and 1 STRUCTURE: Sorry for this sudden update. I have two data sets. The structure of my first data set is (when using print(matr1)
in R
):
month_year income
[1,] "Jan 2000" "30000"
[2,] "Feb 2000" "12364"
[3,] "Mar 2000" "37485"
[4,] "Apr 2000" "2000"
[5,] "Jun 2000" "7573"
. . .
. . .
Thus the first data set has one income value for each month of each year.
The structure of my second data set is (when using print(matr2)
in R
):
month_year value
[1,] "Jan 2000" "84737476"
[2,] "Jan 2000" "39450334"
[3,] "Jan 2000" "48384943"
[4,] "Feb 2000" "12345678"
[5,] "Feb 2000" "49595340"
. . .
. . .
So in this second data set I have n
(say 100 but not constant all the time) number of values for each month of each year.
Both the data sets have values month wise for many subsequent years(like for all the months for year 2000, 2001 and so on). Now I want to find the correlation between these two data sets but month-wise and not as whole. When I use the R command cor(as.numeric(matr1[,"income"]),as.numeric(matr2[,"value"]))
then I get the overall correlation but I want the correlation per month rather than as a whole. I want the correlation something like this:
Jan | Feb | Mar | Apr | May | .....
Correlation x | y | z | p | q | .....
The issue that I have is:
NOTE: I am not sure whether I should have posted this question here or on Cross Validated
. I had posted a question for this data set only regarding an error in getting correlation and it was migrated from there to here. So please forgive if I am posting this on wrong place.
UPDATE1: After some suggestion I have modified this post to point in the correct dimension. First of all, the data sets as of now are in matrix format and hence the quotes. I can convert it to data.frame
as suggested by some comments but right now I have been calculating the correlation by converting the columns using as.numeric
.
Upvotes: 4
Views: 11927
Reputation: 886938
May be you can try:
dat1 <- structure(list(year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2001L,
2001L, 2001L, 2001L, 2001L), month = c(1L, 2L, 3L, 4L, 5L, 1L,
2L, 3L, 4L, 5L), income = c(30000L, 12364L, 37485L, 2000L, 7573L,
25000L, 14364L, 38485L, 4000L, 7873L)), .Names = c("year", "month",
"income"), class = "data.frame", row.names = c(NA, -10L))
dat2 <- structure(list(month_year = c("Jan 2000", "Feb 2000", "Mar 2000",
"Apr 2000", "May 2000", "Jan 2001", "Feb 2001", "Mar 2001", "Apr 2001",
"May 2001"), value = c(84737476L, 39450334L, 48384943L, 12345678L,
49595340L, 84337476L, 34450334L, 48984943L, 124545678L, 49525340L
)), .Names = c("month_year", "value"), class = "data.frame", row.names = c(NA,
-10L))
dat1$month_year <- paste(month.abb[dat1$month], dat1$year)
dat1$month <- gsub(" \\d+","", dat1$month_year)
dat2$month <- gsub(" \\d+","", dat2$month_year)
dat1$indx <- with(dat1, ave(month, month, FUN=seq_along))
dat2$indx <- with(dat2, ave(month, month, FUN=seq_along))
dat1 <- dat1[,c(2,3,5)]
dat2 <- dat2[,c(3,2,4)]
colnames(dat2)[2] <- "income"
library(reshape2)
dat2C <- dcast(dat2, indx~month, value.var="income")
dat1C <- dcast(dat1, indx~month, value.var="income")
m1 <- as.matrix(dat1C[,-1])
m2 <- as.matrix(dat2C[,-1])
cor(m1,m2)
diag(cor(m1,m2))
# Apr Feb Jan Mar May
#1 -1 1 1 -1
Also, if you can merge the two datasets together, this could be done using data.table
. Using the dput()
data above
library(data.table)
dat1$month_year <- paste(month.abb[dat1$month], dat1$year)
dat1 <- dat1[,c(4,3)]
setDT(dat1)
setDT(dat2)
setkey(dat2, month_year)
dat2[dat1, income := i.income]
dat2[,month:= gsub(" \\d+", "", month_year)][,cor(value, income), by=month]
# month V1
#1: Apr 1
#2: Feb -1
#3: Jan 1
#4: Mar 1
#5: May -1
dat1 <- structure(list(month_year = structure(c(5L, 3L, 8L, 1L, 7L, 6L,
4L, 9L, 2L), .Label = c("Apr 2000", "Apr 2001", "Feb 2000", "Feb 2001",
"Jan 2000", "Jan 2001", "Jun 2000", "Mar 2000", "Mar 2001"), class = "factor"),
income = c(30000, 12364, 37485, 2000, 7573, 42000, 15764,
38465, 5000)), .Names = c("month_year", "income"), row.names = c(NA,
-9L), class = "data.frame")
dat2 <- structure(list(month_year = structure(c(5L, 5L, 5L, 3L, 3L, 7L,
7L, 7L, 1L, 1L, 6L, 6L, 4L, 4L, 8L, 8L, 2L, 2L, 2L, 2L), .Label = c("Apr 2000",
"Apr 2001", "Feb 2000", "Feb 2001", "Jan 2000", "Jan 2001", "Mar 2000",
"Mar 2001"), class = "factor"), value = c(84737476, 39450334,
48384973, 12345678, 49595340, 4534353, 43353325, 84333535, 35343232,
4334353, 3434353, 5355322, 5223345, 4523535, 345353, 32235, 423553,
233553, 423535, 884455)), .Names = c("month_year", "value"), row.names = c(NA,
-20L), class = "data.frame")
datN <- merge(dat1, dat2, all=T)
library(data.table)
DT <- data.table(datN)
DT[, month:= gsub(" \\d+", "", month_year)][,cor(value, income),by=month]
# month V1
#1: Apr -0.7136049
#2: Feb -0.7037676
#3: Jan -0.8637808
#4: Jun NA
#5: Mar -0.6484684
Upvotes: 2
Reputation: 94172
Get your data into a data frame with month, value, and income columns. EG:
d = data.frame(month=rep(1:12,5),value=runif(60,10000000,60000000), income=runif(60,5000,40000))
> head(d)
month value income
1 1 58348424 34478.63
2 2 59512513 16179.46
3 3 21844994 20961.56
4 4 25843593 38502.16
5 5 24805863 12397.32
6 6 24200966 24110.27
Then its as simple as using dplyr
to group by month and summarize:
> require(dplyr)
> d %.% group_by(month) %.% summarize(cor = cor(value, income))
Source: local data frame [12 x 2]
month cor
1 1 0.17774478
2 2 -0.61693145
3 3 -0.05692027
4 4 -0.44966542
5 5 -0.30049386
6 6 0.09447414
7 7 0.67567298
8 8 0.14363810
9 9 -0.71899361
10 10 0.20807679
11 11 -0.42560100
12 12 0.23584150
Getting the month number from a date string is covered in many other places... but here I'd use the lubridate
package. For the month/year strings in your second data set, for example:
require(lubridate)
month(dmy(paste("01",dat2$month_year)))
returns the month number. Note the trick of sticking a "01" on the start to make it a valid date.
Upvotes: 0