mEvans
mEvans

Reputation: 905

R: subset from a matrix only those rows with a certain value in a certain column

I have a large matrix "dt" of emergency department visits over 2 months for a set of diagnosis codes. The columns are "age", "sex", "date", "county", "zip", "subjectid", "position", "diag", and "dt"; the dimensions are 872344 by 9.

I want to subset from this matrix and make a new matrix containing only those rows for which the "diag" column has a number between 800 and 849 (all columns).

I have been messing with building a loop and using "which" or "if.else" but I'm running into a mental block. It seems it would be easier if it was just ONE diag code that I wanted to pull out, but the series of 50 codes complicates things... pointing to a loop? Does anyone have ideas for how to subset based on finding certain values?

Here's my start (it didn't work):

dta = dt
b = 800:849
for (i in 1:length(b)) {

}

Upvotes: 2

Views: 13341

Answers (4)

Adam Erickson
Adam Erickson

Reputation: 6363

I would not convert the matrix() to a data.frame() as it is slower and incurs greater memory usage, while matrix() operations are generally faster anyway.

In addition to David's answer using column number indexing:

dta = dt[dt[,8] >= 800 & dt[,8] <= 849,]

There is also the form using column name indexing with a matrix:

dta = dt[dt[,'metric'] >= 800 & dt[,'metric'] <= 849,]

As shown by the microbenchmark package command for an identical matrix with 12 columns and 13,241 rows, run with R compiled with Intel MKL optimization:

microbenchmark::microbenchmark(
     test.matrix     = mt[mt[,3] %in% 5:10 & mt[,5] == 1,],
     test.data.frame = df[df[,3] %in% 5:10 & df[,5] == 1,],
     times = 1000
     )

Unit: microseconds
            expr      min       lq     mean  median        uq        max neval
 test.matrix      885.732  938.386 1154.898  943.74  952.4415 138215.318  1000
 test.data.frame 1176.218 1245.826 1363.379 1258.32 1286.4320 3392.556    1000

When the matrices get very large, this difference becomes tangible. On my machine, matrix indexing speeds outperform those of data.table as well.

Upvotes: 0

discipulus
discipulus

Reputation: 2715

In addition to excellent answers above, I can add filter function in dpylr package

filter(dt,diag>=800 & diag <= 849)

filter() is similar to subset() except that you can give it any number of filtering conditions, which are joined together with & (not && which is easy to do accidentally!). dpylr package also has other nice data manipulating functions which you can have a look.

Upvotes: 0

David Robinson
David Robinson

Reputation: 78590

dta = dt[dt[, 8] >= 800 & dt[, 8] <= 849, ]

ETA: Are you sure this is a matrix and not a data.frame? If it is a data.frame, you can do:

dta = dt[dt$diag >= 800 & dt$diag <= 849, ]

Upvotes: 6

flodel
flodel

Reputation: 89057

Given your column names, I suspect your dt is a data.frame, not a matrix; something you can confirm by running is.data.frame(dt).

If it is the case, an easy way to filter your data is to use the subset function as follows:

dta <- subset(dt, diag >= 800 & diag <= 849)

Upvotes: 5

Related Questions