MysJ
MysJ

Reputation: 3

Selecting first and last value based on binary column value

I am using R and I have a dataframe with over 23,000 entries like so:

 df[1:10,10:13]
   Start   End Distance      under50
1    248  1453        3            0
2   1450  2115      -30            0
3   2145  3254      -68            1
4   3322  5643       -3            0
5   5646  7451       -3            0
6   7454  8641      -14            0
7   8655  9860     -120            1
8   9980 11413      -65            1
9  11478 12614     -364            1
10 12978 13187     -275            1

What I am trying to do is based on the value in df[,4] starting from 0 and ending on 1, select the first number from df[,1] and the number in df[,2] with the column ending on 1.

So for example for this data row #1 starts with 0, row 2 is also 0, and row 3 is 1. So I would like to take 248 from row 1 and 3254 from row 3 then put these in a new dataframe with the column names Start and End.

Then I would like this to loop to find every instance within the dataframe. So for this example it would also pull the values for row start and row 7 end.

So in the end I would like an extracted dataframe that looks like this:

  Start   End      
1    248  3254                 
2   3322  9860
3   ####  ####

I attempted to write a loop that selected start if df[,4]==0 and end based on df[,4]==1 but it was selecting every start with df[,4]==0

I appreciate any help I can get!

Upvotes: 0

Views: 105

Answers (1)

Marius
Marius

Reputation: 60150

Something like this should work:

# The under50 flag is 1 and has just gone up by 1: end of a group
df$group_end = (df$under50 == 1) & (c(0, diff(df$under50)) == 1)
# The under50 flag is 0 and has just gone down by 1 
#   (always true for first row): start of a group
df$group_start = c(-1, diff(df$under50)) == -1

starts = df$Start[df$group_start]
ends = df$End[df$group_end]

You can look at the group_end and group_start columns to check how it's working. Here's the output from your example dataset:

> starts
[1]  248 3322
> ends
[1] 3254 9860

You may have to modify this slightly depending on how your actual dataset looks towards the final rows, it may identify an extra start without an end.

Upvotes: 2

Related Questions