Reputation: 68
I have some financial data (1.5 years SP500 stocks) that I have manipulated into a wide format using the data.table package. After following the whole data.table course on Datacamp, I'm starting to get a hang of the basics, but after searching for hours I'm at a loss on how to do this.
The data contains columns with financial data for each stock. I need to delete columns that contain two consecutive NAs.
My guess is I have to use rle()
, lapply()
, to find consecutive values and DT[,x:= NULL]
) to delete the columns.
I read that rle()
doesn't work on NAs, so I changed them to Inf instead.
I just don't know how to combine the functions so that I can efficiently remove a few columns among the 460 that I have.
An answer using data.table
would be great, but anything that works well is very much appreciated.
Alternatively I would love to know how to remove columns containing at least 1 NA
> test[1:5,1:5,with=FALSE]
date 10104 10107 10138 10145
1: 2012-07-02 0.003199 Inf 0.001112 -0.012178
2: 2012-07-03 0.005873 0.006545 0.001428 Inf
3: 2012-07-05 Inf -0.001951 -0.011090 Inf
4: 2012-07-06 Inf -0.016775 -0.009612 Inf
5: 2012-07-09 -0.002742 -0.006129 -0.001294 0.005830
> dim(test)
[1] 377 461
date 10107 10138
1: 2012-07-02 Inf 0.001112
2: 2012-07-03 0.006545 0.001428
3: 2012-07-05 -0.001951 -0.011090
4: 2012-07-06 -0.016775 -0.009612
5: 2012-07-09 -0.006129 -0.001294
PS. This is my first question, I have tried to adhere to the rules, if I need to change anything please let me know.
Upvotes: 4
Views: 695
Reputation: 49448
Here's an rle
version:
dt[, sapply(dt, function(x)
setDT(rle(is.na(x)))[, sum(lengths > 1 & values) == 0]), with = F]
Or replace the is.na
with is.infinite
if you like.
Upvotes: 4
Reputation: 1462
To detect and delete columns containing atleast one NA, you can try the following
data = data.frame(A=c(1,2,3,4,5), B=c(2,3,4,NA,6), C=c(3,4,5,6,7), D=c(4,5,NA,NA,8))
colsToDelete = lapply(data, FUN = function(x){ sum(is.na(x)) >= 1 })
data.formatted = data[,c(!unlist(colsToDelete))]
Upvotes: 2
Reputation: 2400
This is what I came up with. It calls rle
on a vector y
that is 1:length(column)
unless a corresponding element of the column is Inf
, in which case the corresponding value in y is zero. Then it checks if any of the runs are greater than 1.
keep <- c(date = T, apply(dat[, -1], 2,
function(x) {
y <- 1:length(x)
y[!is.finite(x)] <- 0
return(!any(rle(y)$lengths > 1))
}))
dat2 <- dat[, keep]
dat2
# date X10107 X10138
# 1 2012-07-02 Inf 0.001112
# 2 2012-07-03 0.006545 0.001428
# 3 2012-07-05 -0.001951 -0.011090
# 4 2012-07-06 -0.016775 -0.009612
# 5 2012-07-09 -0.006129 -0.001294
Note that the column names are prepended with an "X" by read.table
.
Now, the dput of the data:
dat <- structure(list(date = c("2012-07-02", "2012-07-03", "2012-07-05",
"2012-07-06", "2012-07-09"), X10104 = c(0.003199, 0.005873, Inf,
Inf, -0.002742), X10107 = c(Inf, 0.006545, -0.001951, -0.016775,
-0.006129), X10138 = c(0.001112, 0.001428, -0.01109, -0.009612,
-0.001294), X10145 = c(-0.012178, Inf, Inf, Inf, 0.00583)), .Names = c("date",
"X10104", "X10107", "X10138", "X10145"), class = "data.frame", row.names = c(NA,
-5L))
Upvotes: 0
Reputation: 451
Obviously the issue is finding consecutive missing.
First, create a matrix TRUE/FALSE
based on missing NA
. Use that matrix to compare each row to next. Keep columns in original matrix where colSums == 0
Try this:
Missing.Mat <- apply(test, 2, is.na)
Consecutive.Mat <- Missing.Mat[-nrow(Missing.Mat),] * Missing.Mat[-1,]
Keep.Cols <- colSums(Consecutive.Mat) == 0
test[,Keep.Cols]
Upvotes: 1