How to speed up this code deleting certain rows?

Question

I have a data.frame where I want to delete rows for which the 5th column entries are equal to zero.

The data.frame looks like this:

Column1 Column2 Column3 Column4 Column5 Column6
1       A         3       2       1       1
2       D         2       2       4       1
3       D         4       1       0       2
4       E         4       1       0       2
5       F         2       1       A       3

So in this case the 3rd and the 4th column should be deleted. My dataframe is called

dataframe and currently I use the following code:

for(i in 1:length(dataframe[,1])){ 
  if (dataframe[i,5]==0) {
    dataframe2<-dataframe[-i,] 
  } 
}

The problem is that I have 162000 entries and my code takes a long time. So how can I get a fast implementation of this?

Ben Bolker · Accepted Answer

I think:

dataframe2 <- dataframe[dataframe[,5]!=0,]

or

dataframe2 <- dataframe[dataframe[,"Column5"]!=0,]

or

dataframe2 <- subset(dataframe, Column5 != 0)

As @dickoa suggests you can also index with $:

dataframe2 <- dataframe[dataframe$Column5 != 0,]

In general:

indexing by column name is more robust and often more readable than indexing by number (although in your example the column names aren't meaningful)
indexing with [[]] or [,] is slightly more robust and general than indexing with $ (for example, you can use variable names constructed on the fly or numeric indices in [[]], and only exact names with $
subset is the most readable, but less robust in some contexts

For a problem of the size you're describing all of these approaches should be more or less instantaneous/indistinguishable in terms of speed.

How to speed up this code deleting certain rows?

Answers (2)

Related Questions