Kevin Sun
Kevin Sun

Reputation: 1211

How to merge two dataframes on a column that needs to be 3 months ahead without using for loop

I have experience using the inner_join function in R on dataframes in which the column values matches between two dataframes. However, I have one dataframe that has the average stock prices for each month for every stock from 2007-2014, and another dataframe that has each stock's financial ratios from 2007-2014 and shows the month of the fiscal year end for each company. The problem is, a company's financial ratios aren't reported until they release their 10K which is 3 months later. Thus, I want to match each company's financial ratios with their appropriate stock price which is 3 months afterwards.

RatioDF:

Symbol Month Year 10KRatio1 10KRatio2 ... 10KRatioN
 FLWS    6   2007   100        200    ...    1000
 ACAD    12  2007   500        600    ...    2000

StockPriceDF:

Company Year Month MeanPrice
 FLWS   2007  1      6.32
   .     .    .       .
   .     .    .       .
   .     .    .       .
 FLWS   2007  9     10.995
   .     .    .       .
   .     .    .       .
   .     .    .       .
 FLWS   2014  12    17.92
   .     .    .       .
 ACAD   2007  1      7.5
   .     .    .       .
   .     .    .       .
   .     .    .       .
 ACAD   2008  3      8.64
   .     .    .       .
   .     .    .       .

DesiredDF:

Symbol Month Year 10KRatio1 10KRatio2 ... 10KRatioN MeanPrice
 FLWS   9    2007   100       200           1000      10.995
 ACAD   3    2008   500       600           2000       8.64

I was thinking of using a for loop to check if the RatioDF months is from 10-12 and then just match it with the months 1-3 of next year for the appropriate Symbol/Company, but I think computation might take too long since there are a lot of stocks for these years and a lot of monthly prices too.

Upvotes: 1

Views: 47

Answers (1)

Jaap
Jaap

Reputation: 83255

A possible solution with lubridate and either data.table or dplyr.

1) with data.table:

# load packages
library(lubridate)
library(data.table)

# convert both dataframes to data.table's and add a 'date'-variable
setDT(d1)[, date := as.IDate(sprintf('%s-%02d-01',Year,Month))][]

# idem + substract 3 months with lubridate's '%m-%` function
setDT(d2)[, date := as.IDate(sprintf('%s-%02d-01',Year,Month)) %m-% months(3)][]

# join d1 with d2 and update d1 by reference
d1[d2, on = .(Symbol = Company, date), MeanPrice := MeanPrice][]

which gives:

   Symbol Month Year 10KRatio1 10KRatio2       date MeanPrice
1:   FLWS     6 2007       100       200 2007-06-01    10.995
2:   ACAD    12 2007       500       600 2007-12-01     8.640

An alternative join-method could be:

d1[d2[, .(Company, date, MeanPrice)], on = .(Symbol = Company, date), nomatch = 0L][]

2) with dplyr:

# load packages
library(lubridate)
library(dplyr)

# add a 'date'-variable to 'd1'
# add a 'date'-variable to 'd2' and substract 3 months
# from that  with lubridate's '%m-%` function
# select only 'Company', 'date' and 'MeanPrice' from 'd2'
# join 'd1' with 'd2'

d1 %>% 
  mutate(date = as.Date(sprintf('%s-%02d-01',Year,Month))) %>% 
  left_join(., d2 %>% 
              mutate(date = as.Date(sprintf('%s-%02d-01',Year,Month)) %m-% months(3)) %>% 
              select(Company, date, MeanPrice),
            by = c('Symbol' = 'Company', 'date'))

which gives the same result:

  Symbol Month Year 10KRatio1 10KRatio2       date MeanPrice
1   FLWS     6 2007       100       200 2007-06-01    10.995
2   ACAD    12 2007       500       600 2007-12-01     8.640

Used data:

d1 <- structure(list(Symbol = c("FLWS", "ACAD"), 
                     Month = c(6L, 12L), 
                     Year = c(2007L, 2007L), 
                     `10KRatio1` = c(100L, 500L), 
                     `10KRatio2` = c(200L, 600L)), 
                .Names = c("Symbol", "Month", "Year", "10KRatio1", "10KRatio2"), class = "data.frame", row.names = c(NA, -2L))

d2 <- structure(list(Company = c("FLWS", "FLWS", "FLWS", "ACAD", "ACAD"),
                     Year = c(2007L, 2007L, 2014L, 2007L, 2008L),
                     Month = c(1L, 9L, 12L, 1L, 3L),
                     MeanPrice = c(6.32, 10.995, 17.92, 7.5, 8.64)),
                .Names = c("Company", "Year", "Month", "MeanPrice"), class = "data.frame", row.names = c(NA, -5L))

Upvotes: 2

Related Questions