Mircea_cel_Batran
Mircea_cel_Batran

Reputation: 105

Subsetting last row from each ID group in a long data set in R

I have a longitudinal data set of 142,415 rows and 965 columns. For each ID in the data set, there are multiple rows, not necessarily the same number of rows for each ID.

I would like to get the last row (data is already sorted) for each ID and created a data frame of just those, keeping all the remaining 964 columns of data.

When I look at previous questions addressing this, a lot of the suggestions use aggregate() and I can't use that (at least from what I know) because I have too many columns.

I did try the following but that's tripped up my computer so I'm wondering if there's a faster way to do this than making a list and then forming a data frame from it:

data.list<-by(data.in, data.in$ID, tail, n=1)
data.new<-do.call("rbind", as.list(data.list))

Upvotes: 0

Views: 749

Answers (3)

akrun
akrun

Reputation: 887048

We can also use filter

library(dplyr)
iris %>%
    group_by(Species) %>%
    filter(row_number()== n())

Upvotes: 0

The answer above probably works better since your data is ordered. This method also works if the data is not ordered.

I presume that there is a single column that you want to use to select the row (e.g. date).

You could solve it by indexing the large dataset and then selecting the id and date row along with the index and then use summarise to find the right index and filter the larger dataset. That way you don't have to swing around all those hundreds of columns.

library(tidyverse)
#Random dataset, date is coded as numeric for practical purposes. 
#var1 and var2 pose as the >900 columns you have
large_dataset <- tibble(id=c(1,1,1,2,2,2,2,3,3,4),
                        date=c(10,15,20,10,15,20,25,10,15,10),
                        var1=rpois(10,50),
                        var2=rpois(10,30))

large_dataset

# A tibble: 10 x 4
  id  date  var1  var2
   <dbl> <dbl> <int> <int>
 1     1    10    41    39
 2     1    15    56    32
 3     1    20    41    34
 4     2    10    46    35
 5     2    15    44    38
 6     2    20    50    33
 7     2    25    58    34
 8     3    10    44    28
 9     3    15    49    37
10     4    10    62    38


large_dataset <- mutate(large_dataset,index = c(1:nrow(large_dataset)))

key <- select(large_dataset,id,date,index) 

key <- group_by(key,id) %>%
summarise(date=max(date)) %>%
left_join(key,by=c("id","date"))

small_dataset <- filter(large_dataset,index %in% key$index) %>% 
select(-index) #Removing the index variable since it has served its purpose


id  date  var1  var2
<dbl> <dbl> <int> <int>

1     1    20    41    34
2     2    25    58    34
3     3    15    49    37
4     4    10    62    38

Upvotes: 0

Matt
Matt

Reputation: 2987

I don't have your data, but using dplyr you could do:

library(dplyr) 

iris %>% 
group_by(Species) %>%
slice(n())

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#1          5           3.3          1.4         0.2 setosa    
#2          5.7         2.8          4.1         1.3 versicolor
#3          5.9         3            5.1         1.8 virginica 

Or, you could use data.table

library(data.table)
setDT(iris)[, .SD[.N], by=Species]

Upvotes: 2

Related Questions