Reputation: 31
Let me start by saying I am rather new to R and generally consider myself to be a novice programmer...so don't assume I know what I'm doing :)
I have a large matrix, approximately 300,000 x 14. It's essentially a 20-year dataset of 15-minute data. However, I only need the rows where the column I've named REC.TYPE contains the string "SAO " or "FL-15".
My horribly inefficient solution was to search the matrix row by row, test the REC.TYPE column and essentially delete the row if it did not match my criteria. Essentially...
j <- 1
for (i in 1:nrow(dataset)) {
if(dataset$REC.TYPE[j] != "SAO " && dataset$RECTYPE[j] != "FL-15") {
dataset <- dataset[-j,] }
else {
j <- j+1 }
}
After watching my code get through only about 10% of the matrix in an hour and slowing with every row...I figure there must be a more efficient way of pulling out only the records I need...especially when I need to repeat this for another 8 datasets.
Can anyone point me in the right direction?
Upvotes: 3
Views: 1964
Reputation: 70653
You want regular expressions. They are case sensitive (as demonstrated below).
x <- c("ABC", "omgSAOinside", "TRALAsaoLA", "tumtiFL-15", "fl-15", "SAOFL-15")
grepl("SAO|FL-15", x)
[1] FALSE TRUE FALSE TRUE FALSE TRUE
In your case, I would do
subsao <- grepl("SAO", x = dataset$REC.TYPE)
subfl <- grepl("FL-15", x = dataset$RECTYPE)
#mysubset <- subsao & subfl # will return TRUE only if SAO & FL-15 occur in the same line
mysubset <- subsao | subfl # will return TRUE if either occurs in the same line
dataset[mysubset, ]
Upvotes: 4
Reputation: 13363
As other posters have said, repeating the subset [
operation is slow. Instead, functions that operate over the entire vector are preferable.
I assume that both your criteria affect REC.TYPE
. My solution uses the function %in%
:
dataset <- dataset[dataset$REC.TYPE %in% c("SAO","FL-15"),]
Upvotes: 3
Reputation: 14136
I couldn't tell from the code you posted but if your data is already in a data.frame, you can do this directly. If not, first run dataset <- data.frame(dataset)
.
From there:
dataset[dataset$REC.TYPE == "SAO " | dataset$RECTYPE == "FL-15",]
should return what you're looking for. For
loops are horribly inefficient in R. Once you've read through the R tutorial, the R inferno will tell you how to avoid some common pitfalls.
The way this particular line works is to filter the data frame, by only returning rows that match the criteria. You can type ?[
into your R interpeter for more information.
Upvotes: 4