Filtering Nested Data With Differing Row Lengths in R

Question

I have a seemingly simple problem that I can't figure out. I have a dataset with fish counts via GoPros and am having some trouble subsetting using dplyr in R. The videos are nested into sets and each video set has a differing number of videos in it. I'm trying to subset using filter to only get the observations from the first 20 videos. The issue is that the number of rows for each unique video differ with the number of fish observed. Each row is a single observation (a count of 1 or 0), so if there were no fish in the video or only a single fish, then only 1 row is associated with unique video. But if there is a school of fish then there could be upwards of 100 rows associated with a single unique video. Because of this, I can't sample based on row length or location. I suppose I could do this by hand, but I'm assuming there's a more elegant solution that I'm just missing. Any help would be greatly appreciated!

Note: I have 1 video set with only 17 videos that I would like to keep, but I could always separate that out and rebind if that throws a wrench into the code. I think I've simulated this correctly in the sample data with video set 2, which is below the threshold for the number of videos I want.

#create simulated data
data <- data.frame(Site= c('A', 'A', 'A', 'A', 'A', 'A','B', 'C', 'C', 'C', 'C', 'C'),
                             Video_Set=c(1,1,1,1,1,1,2,3,3,3,3, 3), #Unique ID for each video set
                             Video_Unique=c(1,2,3,3,3,3,4,5,6,6,7,8), #Unique ID for each video
                             Fish_Present=c('Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes','No', 'Yes', 'Yes', 'Yes', 'Yes','No'))
print(data)

#Here's what I tried
sample <- data %>% 
  group_by(Video_Set) %>% 
  filter(n_distinct(Video_Unique) > 2) %>%
  # here I just created a threshold of 2 instead of 20
  ungroup()

#here's what I'm trying to get
goal <- data[c(1,2,7,8,9),]
print(goal)

Edit/clarification: Hi all, sorry for the confusion on what I want to subset. I'm hoping to get all associated observations for the first 20 unique videos in each video set. For example, if there are 100 rows associated with video 15 from video set 3, I would like all 100 rows.

zephryl · Accepted Answer

It’s not clear to me if you want to keep all rows associated with the first n videos per set (what I understood from your description), or if you want to take just the first row from the first n videos per set (implied by your goal dataset). So I’ll provide solutions for both.

To keep all rows for the first 2 videos per set, you can create a sequential video number per group using dplyr::dense_rank(), then filter to videos where this <= 2.

library(dplyr)

data %>% 
  group_by(Video_Set) %>% 
  filter(dense_rank(Video_Unique) <= 2) %>%
  ungroup()

# A tibble: 6 × 4
  Site  Video_Set Video_Unique Fish_Present
                       
1 A             1            1 Yes         
2 A             1            2 Yes         
3 B             2            4 No          
4 C             3            5 Yes         
5 C             3            6 Yes         
6 C             3            6 Yes

To keep just the first row for the first 2 videos per group, you can use two dplyr::slice() operations: the first grouped by Video_Set and Video_Unique to get just the first row for each video, and the second grouped by Video_Set only to get the first two videos per set.

data %>% 
  group_by(Video_Set, Video_Unique) %>%
  slice(1) %>%
  group_by(Video_Set) %>%
  slice(1:2) %>%
  ungroup()

# A tibble: 5 × 4
  Site  Video_Set Video_Unique Fish_Present
                       
1 A             1            1 Yes         
2 A             1            2 Yes         
3 B             2            4 No          
4 C             3            5 Yes         
5 C             3            6 Yes

Filtering Nested Data With Differing Row Lengths in R

Answers (2)

Related Questions