TiltingTrain
TiltingTrain

Reputation: 33

Remove rows in dataframe, which are not consecutive

My question is related to my other question but it's not quite the same.

Assuming I have this:

DOY  <-c (1:30) # Day of Year
rain <-c (0,0,0,0,0,0,10,0,0,20,10,5,0,0,0,0,0,0,0,5,20,45,5,2,0,0,0,10,0,0)
DF <- data.frame(DOY=DOY, rain=rain)
# Just interested in days with rain
DF <- DF[DF$rain > 0,] 
# assign ID for each rainfall event, but only if Difference in days is >4
DF$ID <-c (1,1+cumsum(diff(DF$DOY) > 4)) 

# Rerranging, so I have the highest value at top for each group
DF <- DF %>%
  group_by(ID) %>%
  arrange(desc(rain),.by_group=TRUE) 

My DF looks like this:

DOY rain    ID
10    20     1
 7    10     1
11    10     1
 9     9     1
12     5     1
22    45     2
21    20     2
28    10     2
20     5     2
23     5     2
24     2     2

But only want days which are consecutive from the day of max value. My expected output should then be:

DOY rain    ID
10    20     1
11    10     1
 9     9     1
12     5     1
22    45     2
21    20     2
20     5     2
23     5     2
24     2     2

Rows with DOY = 7 and DOY 28 should be removed in my case, all other DOYs are kept.

Upvotes: 0

Views: 435

Answers (2)

Otto K&#228;ssi
Otto K&#228;ssi

Reputation: 3083

Here's my approach:

DF %>% arrange(ID, DOY) %>% 
       group_by(ID) %>% 
       mutate(lDOY = lag(DOY)) %>% 
       slice(c(min(which(DOY - lDOY == 1) -1), which(DOY - lDOY == 1))) %>%
       select(-lDOY) %>%
       arrange(ID, -rain)

Here we first sort by day and ID, and use slice only keep observations where tge difference of adjacent DOYs is 1. The min() part in slice() is needed to retain first observation of each sequence. Finally, the output is sorted by id and rain.

Note that this works only if you have exactly one sequence for each id.

Upvotes: 1

Dan Chaltiel
Dan Chaltiel

Reputation: 8484

I think you can reuse the answer from your other question here (which was quite a neat trick BTW).

DF %>%
  group_by(ID) %>%
  arrange(DOY,.by_group=TRUE) %>%
  mutate(
    max_rain = max(rain),
    i=cumsum(c(TRUE,diff(DOY)>1))
  ) %>%
  group_by(i, ID) %>%
  filter(any(rain==max_rain))

Basically, for each ID, you save the maximal rain and you create another ordering group ID (i here). You have to arrange by DOY for this to work.

Then, you just need to group by both ID and filter out all groups that do not contain the max rain value.

Upvotes: 1

Related Questions