Reputation: 905
I have a large matrix "dt" of emergency department visits over 2 months for a set of diagnosis codes. The columns are "age", "sex", "date", "county", "zip", "subjectid", "position", "diag", and "dt"; the dimensions are 872344 by 9.
I want to subset from this matrix and make a new matrix containing only those rows for which the "diag" column has a number between 800 and 849 (all columns).
I have been messing with building a loop and using "which" or "if.else" but I'm running into a mental block. It seems it would be easier if it was just ONE diag code that I wanted to pull out, but the series of 50 codes complicates things... pointing to a loop? Does anyone have ideas for how to subset based on finding certain values?
Here's my start (it didn't work):
dta = dt
b = 800:849
for (i in 1:length(b)) {
}
Upvotes: 2
Views: 13341
Reputation: 6363
I would not convert the matrix()
to a data.frame()
as it is slower and incurs greater memory usage, while matrix()
operations are generally faster anyway.
In addition to David's answer using column number indexing:
dta = dt[dt[,8] >= 800 & dt[,8] <= 849,]
There is also the form using column name indexing with a matrix:
dta = dt[dt[,'metric'] >= 800 & dt[,'metric'] <= 849,]
As shown by the microbenchmark
package command for an identical matrix with 12 columns and 13,241 rows, run with R compiled with Intel MKL optimization:
microbenchmark::microbenchmark(
test.matrix = mt[mt[,3] %in% 5:10 & mt[,5] == 1,],
test.data.frame = df[df[,3] %in% 5:10 & df[,5] == 1,],
times = 1000
)
Unit: microseconds
expr min lq mean median uq max neval
test.matrix 885.732 938.386 1154.898 943.74 952.4415 138215.318 1000
test.data.frame 1176.218 1245.826 1363.379 1258.32 1286.4320 3392.556 1000
When the matrices get very large, this difference becomes tangible. On my machine, matrix indexing speeds outperform those of data.table
as well.
Upvotes: 0
Reputation: 2715
In addition to excellent answers above, I can add filter
function in dpylr
package
filter(dt,diag>=800 & diag <= 849)
filter()
is similar to subset()
except that you can give it any number of filtering conditions, which are joined together with &
(not &&
which is easy to do accidentally!). dpylr
package also has other nice data manipulating functions which you can have a look.
Upvotes: 0
Reputation: 78590
dta = dt[dt[, 8] >= 800 & dt[, 8] <= 849, ]
ETA: Are you sure this is a matrix and not a data.frame? If it is a data.frame, you can do:
dta = dt[dt$diag >= 800 & dt$diag <= 849, ]
Upvotes: 6
Reputation: 89057
Given your column names, I suspect your dt
is a data.frame, not a matrix; something you can confirm by running is.data.frame(dt)
.
If it is the case, an easy way to filter your data is to use the subset
function as follows:
dta <- subset(dt, diag >= 800 & diag <= 849)
Upvotes: 5