user7477092
user7477092

Reputation:

How to find percentile and then group in R

I have a data frame like below (df).

day  area   hour  time  count
___  ____  _____  ___   ____
 1    1      0     1     10
 1    1      0     2     12
 1    1      0     3     8
 1    1      0     4     12    
 1    1      0     5     15  
 1    1      0     6     18 
 1    1      1     1     10
 1    1      1     2     12
 1    1      1     3     8
 1    1      1     4     12    
 1    1      1     5     15  
 1    1      1     6     18
 1    1      1     7     12    
 1    1      1     8     15  
 1    1      1     9     18
 1    1      2     1     10    
 1    1      2     2     18  
 1    1      2     3     19
 .....
 2    1      0     1     18
 2    1      0     2     12
 2    1      0     3     18
 2    1      0     4     12    
 2    1      1     1     8
 2    1      1     2     12
 2    1      1     3     18
 2    1      1     4     10    
 2    1      1     5     15  
 2    1      1     6     18
 2    1      1     7     12    
 2    1      1     8     15  
 2    1      1     9     18
 2    1      2     1     10    
 2    1      2     2     18  
 2    1      2     3     19
 2    1      2     4     9    
 2    1      2     5     18  
 2    1      2     6     9


..... 
 30    99      23     1     9    
 30    99      23     2     8  
 30    99      23     3     9
 30    99      23     4     19    
 30    99      23     5     18  
 30    99      23     6     9
 30    99      23     7     19    
 30    99      23     8     8  
 30    99      23     9     19

Here I have the data for 30 days for 87 areas (1 to 82 and then I have 90,93,95,97,99) and 24 hours (0 to 23) per day.So the data is about the time taken to cross the area and how many have crossed.

For example:

day  area   hour  time  count
___  ____  _____  ___   ____
 1    1      0     1     10
 1    1      0     2     12
 1    1      0     3     8
 1    1      0     4     12    
 1    1      0     5     15  
 1    1      0     6     18 

This gives me the On day 1 on hour 0 the time taken to cross the area 1

time  count   cumulative_count
___    ___    ________________
 1     10           10
 2     12           22
 3     8            30
 4     12           42    
 5     15           57
 6     18           75 
10 vehicles crossed the area in 1 minute.
12 vehicles crossed the area in 2 minutes.
8 vehicles crossed the area in 3 minutes.
12 vehicles crossed the area in 4 minutes.
15 vehicles crossed the area in 5 minutes.
18 vehicles crossed the area in 6 minutes.

From this I want to calculate How much time it took for 80% of the vehicles to cross area 1 in day 1 hour 0.So total vehicles=(10+12+8+12+15+18)=75.So 80% of 75 is 60.So time taken for 80% of the vehicles(80% of 75 which is 60) to pass the area 1 at day 1 hour 0 will be between 5 and 6(will be nearer to 5). So the result will be like:

 day  area   hour    time_taken_for_80%vehicles_to_pass
    ___  ____   ____    ___________________________________
     1    1      0                5.33(approximately)
     1    1      1                7.30
     1    1      2                2.16
    ....
     30   1      23               3.13
     1    2      0                ---
     1    2      1                ---
     1    2      2                ---
     1    2      3                ---

 .......

     30    99     21              ---
     30    99     22              ---
     30    99     23              ---

   I know to I have to take quantile and then group by the area and day and hour.So I tried with 

library(dplyr)
grp <- group_by(df, day,area,hour,quantile(df$count,0.8))

But it does not work.Any help is appreciated

Upvotes: 0

Views: 121

Answers (1)

Wietze314
Wietze314

Reputation: 6020

My solution calculates the percentage of vehicles that crossed the area, for each time. Then gets the first time the percentage is above 80%:

str <- 'day  area   hour  time  count
1    1      0     1     10
1    1      0     2     12
1    1      0     3     8
1    1      0     4     12    
1    1      0     5     15  
1    1      0     6     18
1    1      1     1     10
1    1      1     2     12
1    1      1     3     8
1    1      1     4     12    
1    1      1     5     15  
1    1      1     6     18
1    1      1     7     12    
1    1      1     8     15  
1    1      1     9     18
1    1      2     1     10    
1    1      2     2     18  
1    1      2     3     19'



file <- textConnection(str)
df <- read.table(file, header = T)

df

library(dplyr)
df %>% group_by(day, area, hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount)) %>%
  filter(p > 0.8) %>%
  summarise(time = min(time))

result:

    day  area  hour  time
  <int> <int> <int> <int>
1     1     1     0     6
2     1     1     1     8
3     1     1     2     3

Or with a linear estimation of the time when 80% is reached:

df %>% group_by(day, area, hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount),
         g = +(p > 0.8),
         order = (g*2-1)*time) %>%
  group_by(day, area, hour,g) %>%
  filter(row_number((g*2-1)*time)==1) %>%
  group_by(day, area, hour) %>%
  summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))

result:

    day  area  hour     time
  <int> <int> <int>    <dbl>
1     1     1     0 5.166667
2     1     1     1 7.600000
3     1     1     2 2.505263

or get the same result using lag and lead

df %>% group_by(day, area, hour) %>%
  arrange(hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount)) %>%
  filter((p >= 0.8&lag(p)<0.8)|(p < 0.8&lead(p)>=0.8)) %>%
  summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))

Upvotes: 1

Related Questions