Vedda
Vedda

Reputation: 7435

dplyr to check each firm has June history

I have a 1m+ data set and need to go through and check that for each firm(cusip) and year(fyear) there is an observation for June, where datadate is YYYYMMDD. I've tried using substr() to pull out the month and did a logical test and if true then leave alone, but if not, that cusip gets removed. However, this is not working and is returning errors about not a logical argument and condition lengths. I've stepped through each of these outside of dplyr to make sure everything is working and I haven't run into any problems except for inside dplyr. Any help would be greatly appreciated.

Reproducible code :

tdata <- structure(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 2), fyear = c(1962L, 
1963L, 1964L, 1965L, 1966L, 1967L, 1968L, 1969L, 1970L, 1971L, 
1972L, 1973L, 1974L, 1975L, 1976L, 1977L, 1978L, 1979L, 1980L, 
1981L, 1982L, 1983L, 1984L, 1985L, 1962L, 1963L, 1964L, 1965L, 
1966L, 1967L, 1969L), datadate = c(19620631L, 19630631L, 19640631L, 
19651231L, 19661231L, 19670631L, 19680631L, 19691231L, 19700631L, 
19710631L, 19720631L, 19730631L, 19740631L, 19751231L, 19760631L, 
19770631L, 19780631L, 19791231L, 19800631L, 19810631L, 19820631L, 
19831231L, 19841231L, 19850631L, 19621231L, 19630631L, 19640631L, 
19650631L, 19660631L, 19670631L, 19690631L)), .Names = c("cusip", "fyear", 
"datadate"), row.names = c(NA, 31L), class = "data.frame")

tdata %>% 
  group_by(cusip) %>% 
  group_by(fyear) %>% 
  arrange(desc(datadate)) %>% 
  if(substr(datadate[1], 5,6) != 06) cusip <- NULL

Error :

Error in if (.) as.numeric(substr(datadate[1], 5, 6)) != 6 else cusip <- NULL : 
  argument is not interpretable as logical
In addition: Warning message:
In if (.) as.numeric(substr(datadate[1], 5, 6)) != 6 else cusip <- NULL :
  the condition has length > 1 and only the first element will be used

Upvotes: 1

Views: 92

Answers (1)

JasonAizkalns
JasonAizkalns

Reputation: 20473

Why not create a column for month first? Something like:

library(dplyr)
tdata$month <- substr(tdata$datadate, 5, 6)

tdata %>%
  group_by(cusip, fyear) %>%
  mutate(has_June = month == "06")

Notice that month is a string, therefore to check for equality you will need to use quotes.

All in one go:

tdata %>%
  group_by(cusip, fyear) %>%
  mutate(month = substr(datadate, 5, 6),
         has_June = month == "06")

Then you could find those without June with the addition of: %>% filter(month != "06")

Upvotes: 1

Related Questions