Effecient way for filtering dataframe by column value

Question

I have the following dataframe (DF1), which contains much more rows (this is only the head of the dataframe):

"transactionID" "sequenceID" "eventID" "items"
"1" 29698 40928360 1 "OV4"
"2" 23561 41469527 1 "OV6"
"3" 23562 41469527 2 "OV7"
"4" 10514 41487127 1 "OV8"
"5" 10515 41487127 2 "OV9"
"6" 10516 41487127 3 "OV10"
"7" 10517 41487127 4 "OV11"

I would like to filter the all the rows which their maximal eventID value is over 3. As seen, each eventId is grouped by sequenceID. I would like to keep the order and the content of the rows. The required result would be:

"transactionID" "sequenceID" "eventID" "items"
"4" 10514 41487127 1 "OV8"
"5" 10515 41487127 2 "OV9"
"6" 10516 41487127 3 "OV10"
"7" 10517 41487127 4 "OV11"

Since, only sequenceID = 41487127 contains eventID of 4. The rest contains only 1 or 2 eventID. I used the following command:

> data7<-setDT(DF1)[, if(any(eventID >= 3)) .SD, by = sequenceID ]

> data7
   sequenceID transactionID eventID items
1:   41487127         10514       1   OV8
2:   41487127         10515       2   OV9
3:   41487127         10516       3  OV10
4:   41487127         10517       4  OV11

The position of the column transactionID was changed. Why is it? What can I do just to filter the dataframe without changing the column's positions? Is my command is the most efficient way for such filtering?

akrun · Accepted Answer

We are grouping by 'sequenceID', which becomes the first column after filtering. But, we can rearrange using the original column names of the dataset.

setDT(DF1)[,if(any(eventID>3)) .SD , sequenceID][,names(DF1),with=FALSE]
#   transactionID sequenceID eventID items
#1:         10514   41487127       1   OV8
#2:         10515   41487127       2   OV9
#3:         10516   41487127       3  OV10
#4:         10517   41487127       4  OV11

Suppose, we need the position of the rows that were filtered, we create a row names column ('rn') using keep.rownames and then do the filter.

setDT(DF1, keep.rownames=TRUE)[, if(any(eventID>3)) .SD , sequenceID
                               ][, names(DF1), with=FALSE]
#   rn transactionID sequenceID eventID items
#1:  4         10514   41487127       1   OV8
#2:  5         10515   41487127       2   OV9
#3:  6         10516   41487127       3  OV10
#4:  7         10517   41487127       4  OV11

Effecient way for filtering dataframe by column value

Answers (1)

Related Questions