Reputation: 105
I have a longitudinal data set of 142,415 rows and 965 columns. For each ID in the data set, there are multiple rows, not necessarily the same number of rows for each ID.
I would like to get the last row (data is already sorted) for each ID and created a data frame of just those, keeping all the remaining 964 columns of data.
When I look at previous questions addressing this, a lot of the suggestions use aggregate()
and I can't use that (at least from what I know) because I have too many columns.
I did try the following but that's tripped up my computer so I'm wondering if there's a faster way to do this than making a list and then forming a data frame from it:
data.list<-by(data.in, data.in$ID, tail, n=1)
data.new<-do.call("rbind", as.list(data.list))
Upvotes: 0
Views: 749
Reputation: 887048
We can also use filter
library(dplyr)
iris %>%
group_by(Species) %>%
filter(row_number()== n())
Upvotes: 0
Reputation: 161
The answer above probably works better since your data is ordered. This method also works if the data is not ordered.
I presume that there is a single column that you want to use to select the row (e.g. date).
You could solve it by indexing the large dataset and then selecting the id and date row along with the index and then use summarise to find the right index and filter the larger dataset. That way you don't have to swing around all those hundreds of columns.
library(tidyverse)
#Random dataset, date is coded as numeric for practical purposes.
#var1 and var2 pose as the >900 columns you have
large_dataset <- tibble(id=c(1,1,1,2,2,2,2,3,3,4),
date=c(10,15,20,10,15,20,25,10,15,10),
var1=rpois(10,50),
var2=rpois(10,30))
large_dataset
# A tibble: 10 x 4
id date var1 var2
<dbl> <dbl> <int> <int>
1 1 10 41 39
2 1 15 56 32
3 1 20 41 34
4 2 10 46 35
5 2 15 44 38
6 2 20 50 33
7 2 25 58 34
8 3 10 44 28
9 3 15 49 37
10 4 10 62 38
large_dataset <- mutate(large_dataset,index = c(1:nrow(large_dataset)))
key <- select(large_dataset,id,date,index)
key <- group_by(key,id) %>%
summarise(date=max(date)) %>%
left_join(key,by=c("id","date"))
small_dataset <- filter(large_dataset,index %in% key$index) %>%
select(-index) #Removing the index variable since it has served its purpose
id date var1 var2
<dbl> <dbl> <int> <int>
1 1 20 41 34
2 2 25 58 34
3 3 15 49 37
4 4 10 62 38
Upvotes: 0
Reputation: 2987
I don't have your data, but using dplyr
you could do:
library(dplyr)
iris %>%
group_by(Species) %>%
slice(n())
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
#1 5 3.3 1.4 0.2 setosa
#2 5.7 2.8 4.1 1.3 versicolor
#3 5.9 3 5.1 1.8 virginica
Or, you could use data.table
library(data.table)
setDT(iris)[, .SD[.N], by=Species]
Upvotes: 2