Reputation: 7435
I have a bunch of unique cusip codes(unique ID) and need to check to make sure there are 24-60 previous month observations, but am not sure how to check this with dplyr
tdata <- structure(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2), fyear = c("1971", "1971", "1971", "1971",
"1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971",
"1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972",
"1972", "1972", "1972", "1972", "1972", "1973", "1973", "1973",
"1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973",
"1973", "1974", "1974", "1974", "1974", "1974", "1974", "1974",
"1974", "1974", "1974", "1974", "1974", "1975", "1975", "1975",
"1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975"
), datadate = c(19711231L, 19710129L, 19710226L, 19710331L, 19710430L,
19710528L, 19710630L, 19710730L, 19710831L, 19710930L, 19711029L,
19711130L, 19721231L, 19720131L, 19720229L, 19720330L, 19720428L,
19720531L, 19720630L, 19720731L, 19720831L, 19720929L, 19721031L,
19721130L, 19721229L, 19731231L, 19730131L, 19730228L, 19730330L,
19730430L, 19730531L, 19730629L, 19730731L, 19730831L, 19730928L,
19731031L, 19731130L, 19741231L, 19740131L, 19740228L, 19740329L,
19740430L, 19740531L, 19740628L, 19740731L, 19740830L, 19740930L,
19741031L, 19741129L, 19751231L, 19750131L, 19750228L, 19750331L,
19750430L, 19750530L, 19750630L, 19750731L, 19750829L, 19750930L,
19751031L)), .Names = c("cusip", "fyear", "datadate"), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -60L), vars = list(
cusip, fyear), drop = TRUE, indices = list(0:11, 12:24, 25:36,
37:48, 49:59), group_sizes = c(12L, 13L, 12L, 12L, 11L), biggest_group_size = 13L, labels = structure(list(
cusip = c(2, 2, 2, 2, 2), fyear = c("1971", "1972", "1973",
"1974", "1975")), class = "data.frame", row.names = c(NA,
-5L), .Names = c("cusip", "fyear"), vars = list(cusip, fyear)))
I was thinking about checking the total months for each year, but don't know know to extract the previous to months to check if 24/60 >= 0.4. How would I edited this code to check the previous 60 months and ensure there are at least 24 months including....
tdata %>%
group_by(cusip, fyear) %>%
mutate(month = substr(datadate, 5, 6) %>%
mutate(pre_countmonths = length(unique(month))
Here is the logic I am following with for loops. One of the challenges I'm having with R is branching outside of for loops. Any possible way to edit this where dplyr
can be utilized instead of for
loops? This will take entirely too long to run with my current data.
for(i in min(tdata$cusip):max(tdata$cusip)){
for (j in min(tdata$fyear):max(tdata$fyear) {
monthcheck <- filter(tdata, cusip == i & (fyear == j-1 | fyear == j-2 | fyear == j-3 | fyear == j-4))
if(length(monthcheck$month) / 40 >= 0.4) if(any(tdata$fyear == j)) tdata$check <- 1
}}
Small Subset : https://www.dropbox.com/s/mf0o0tbgbame6k8/testdata.csv?dl=0
Upvotes: 0
Views: 119
Reputation: 23574
This is what I got within my time limit. I hope that this gives you some ideas, and that other users provide better solutions.
mydf <- as_data_frame(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2), fyear = c("1971", "1971", "1971", "1971",
"1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971",
"1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972",
"1972", "1972", "1972", "1972", "1972", "1973", "1973", "1973",
"1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973",
"1973", "1974", "1974", "1974", "1974", "1974", "1974", "1974",
"1974", "1974", "1974", "1974", "1974", "1975", "1975", "1975",
"1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975"
), datadate = c(19711231L, 19710129L, 19710226L, 19710331L, 19710430L,
19710528L, 19710630L, 19710730L, 19710831L, 19710930L, 19711029L,
19711130L, 19721231L, 19720131L, 19720229L, 19720330L, 19720428L,
19720531L, 19720630L, 19720731L, 19720831L, 19720929L, 19721031L,
19721130L, 19721229L, 19731231L, 19730131L, 19730228L, 19730330L,
19730430L, 19730531L, 19730629L, 19730731L, 19730831L, 19730928L,
19731031L, 19731130L, 19741231L, 19740131L, 19740228L, 19740329L,
19740430L, 19740531L, 19740628L, 19740731L, 19740830L, 19740930L,
19741031L, 19741129L, 19751231L, 19750131L, 19750228L, 19750331L,
19750430L, 19750530L, 19750630L, 19750731L, 19750829L, 19750930L,
19751031L)))
# Make it normal data.frame
mydf <- data.frame(mydf)
# Create another data frame with a new cusip
mydf2 <- mutate(mydf, cusip = 3)
### Create a new data frame which is missing one data point
foo <- bind_rows(mydf, mydf2[-4, ])
In this pseudo data, cusip 3 is missing one month data. That means, you do not have consecutive 24-60-month data for cusip
3. First, I created a column with month and a column with date object. Then, I ordered your data by cusp
and datadate. I wanted to select data points which stay between this 24-60 month period. This is the first filter
part. I grouped the data by cusp
. Using month, I wanted to check if I have consecutive data points or not. You would expect lead(month)-month
= 1, 11, or 0. If you have two data points from a same month, you would expect 0. That is happening in your data. The final filter
is something you can revise. Here, I wanted to remove any cusip
which got FALSE in check. In this draft, this filter seems to be doing the right thing; you do not see any data of cusip 3 in the end. I hope this helps you.
mutate(foo, month = as.numeric(substr(datadate, 5, 6))) %>%
mutate(datadate = as.POSIXct(gsub("^(\\d{4})(\\d{2}).*$", "\\1-\\2-01", datadate),
format("%Y-%m-%d"), tz = "GMT")) %>%
arrange(cusip, datadate) %>%
filter(between(datadate,
datadate[tail(which(month == 6, arr.ind = TRUE), n = 1)] - (60*60*24*30*60),
datadate[tail(which(month == 6, arr.ind = TRUE), n = 1)] -(60*60*24*30*24))) %>%
group_by(cusip) %>%
mutate(check = abs(lead(month)-month) == 11|abs(lead(month)-month) == 1|abs(lead(month)-month) == 0) %>%
filter(all(check == TRUE | check %in% NA))
Upvotes: 1