MattB
MattB

Reputation: 31

Searching a matrix for only certain records

Let me start by saying I am rather new to R and generally consider myself to be a novice programmer...so don't assume I know what I'm doing :)

I have a large matrix, approximately 300,000 x 14. It's essentially a 20-year dataset of 15-minute data. However, I only need the rows where the column I've named REC.TYPE contains the string "SAO " or "FL-15".

My horribly inefficient solution was to search the matrix row by row, test the REC.TYPE column and essentially delete the row if it did not match my criteria. Essentially...

   j <- 1
   for (i in 1:nrow(dataset)) {
      if(dataset$REC.TYPE[j] != "SAO  " && dataset$RECTYPE[j] != "FL-15") {
        dataset <- dataset[-j,]  }
      else {
        j <- j+1  }
   }

After watching my code get through only about 10% of the matrix in an hour and slowing with every row...I figure there must be a more efficient way of pulling out only the records I need...especially when I need to repeat this for another 8 datasets.

Can anyone point me in the right direction?

Upvotes: 3

Views: 1964

Answers (3)

Roman Luštrik
Roman Luštrik

Reputation: 70653

You want regular expressions. They are case sensitive (as demonstrated below).

x <- c("ABC", "omgSAOinside", "TRALAsaoLA", "tumtiFL-15", "fl-15", "SAOFL-15")
grepl("SAO|FL-15", x)
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

In your case, I would do

subsao <- grepl("SAO", x = dataset$REC.TYPE)
subfl <- grepl("FL-15", x = dataset$RECTYPE)
#mysubset <- subsao & subfl # will return TRUE only if SAO & FL-15 occur in the same line
mysubset <- subsao | subfl # will return TRUE if either occurs in the same line
dataset[mysubset, ]

Upvotes: 4

Blue Magister
Blue Magister

Reputation: 13363

As other posters have said, repeating the subset [ operation is slow. Instead, functions that operate over the entire vector are preferable.

I assume that both your criteria affect REC.TYPE. My solution uses the function %in%:

dataset <- dataset[dataset$REC.TYPE %in% c("SAO","FL-15"),]

Upvotes: 3

Wilduck
Wilduck

Reputation: 14136

I couldn't tell from the code you posted but if your data is already in a data.frame, you can do this directly. If not, first run dataset <- data.frame(dataset).

From there:

dataset[dataset$REC.TYPE == "SAO  " | dataset$RECTYPE == "FL-15",]

should return what you're looking for. For loops are horribly inefficient in R. Once you've read through the R tutorial, the R inferno will tell you how to avoid some common pitfalls.

The way this particular line works is to filter the data frame, by only returning rows that match the criteria. You can type ?[ into your R interpeter for more information.

Upvotes: 4

Related Questions