useR
useR

Reputation: 101

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.

Here is a data set

ID Score Time
1    0    0
1    3    5
1    -2   9
1    -4   17
1    -7   31
1    -1   43
2    0    0
2    -3   15
2    0    19
2    4    25
2    6    29
2    9    33
2    3    37
3    0    0
3    5    3
3    2    11

So for this data set, I would hopefully get this output:

ID Score Time
1   -7    31    
1   -1    43
2    6    29 
2    9    33
2    3    37
3    2    11

So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).

My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:

Data[Data$Score > 5 | Data$Score < -5]

Let me know if you need anymore information.

Upvotes: 2

Views: 3566

Answers (4)

Joe
Joe

Reputation: 8611

Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.

library(tidyverse)
lastrows  <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()

# A tibble: 6 x 3
# Groups:   ID [3]
#      ID Score  Time
#   <int> <int> <int>
# 1     1    -7    31
# 2     1    -1    43
# 3     2     6    29
# 4     2     9    33
# 5     2     3    37
# 6     3     2    11

Upvotes: 0

Rich Scriven
Rich Scriven

Reputation: 99331

Here's a go at it in data.table, where df is your original data frame.

library(data.table)
setDT(df)

df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
#    ID Score Time
# 1:  1    -7   31
# 2:  1    -1   43
# 3:  2     6   29
# 4:  2     9   33
# 5:  2     3   37
# 6:  3     2   11

We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.

Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.

Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is

df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
#    ID Score Time
# 1:  1    -7   31
# 2:  1    -1   43
# 3:  2     6   29
# 4:  2     9   33
# 5:  2     3   37
# 6:  3     2   11

Upvotes: 3

blakeoft
blakeoft

Reputation: 2400

You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.

Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
#   ID Score Time
#6   1    -1   43
#13  2     3   37
#16  3     2   11

To combine the two conditions, use rbind.

Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])

To get rid of rows that satisfy both conditions, you can use duplicated and rownames.

Data2 <- Data2[!duplicated(rownames(Data2)), ]

You can also sort if desired, of course.

Upvotes: 3

lmo
lmo

Reputation: 38500

Here is another base R solution.

df[as.logical(ave(df$Score, df$ID,
                  FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]

   ID Score Time
5   1    -7   31
6   1    -1   43
11  2     6   29
12  2     9   33
13  2     3   37
16  3     2   11

abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.

Upvotes: 2

Related Questions