Yves
Yves

Reputation: 556

How to count number of occurrences within subgroup with R?

I have the following data.frame (dput() at the end of question).

(c_arr_cords[1:20,])
   linkId       x       y vehicleRefId new_arr_time duration dep_time
1      90 2681090 1245442      1267069          0.0      6.5      6.5
2      90 2681090 1245442       532654         11.0      1.0     12.0
3      90 2681090 1245442      1398907         12.5      0.5     12.5
4      90 2681090 1245442      1267069         12.0     24.0     36.0
5      90 2681090 1245442         4205         16.5      0.0     16.5
6      90 2681090 1245442      1111105         11.0      0.0     11.5
7      90 2681090 1245442       434774         16.0      0.0     16.5
8      90 2681090 1245442      1179923          0.0     15.5     15.5
9      90 2681090 1245442        46491         14.5      0.5     15.0
10     90 2681090 1245442      1179923         16.0     19.5     36.0
11     90 2681090 1245442      1326473         11.0      3.0     13.5
12     90 2681090 1245442      1239391         13.0      0.5     13.5
13     90 2681090 1245442       810534          8.0      0.0      8.0
14     90 2681090 1245442        51825          9.5      0.5     10.0
15     90 2681090 1245442      1199672         11.0      1.0     12.0
16     90 2681090 1245442      1269433         17.5      1.5     19.0
17    389 2681367 1247844       492533         14.5      1.5     16.0
18    389 2681367 1247844      1454119         17.5     18.0     36.0
19    389 2681367 1247844      1278645          0.0      8.0      8.0
20    389 2681367 1247844      1430553         10.5      1.5     12.0

My goal is to create a data.frame where I see how many vehicles are on a linkId at any 1h time bin. If a vehicle is on a link at time x can be derived from new_arr_time (arrival) and dep_time (departure). For time = 12 (hour 12) on link 90, one has to count how many vehicles have their new_arr_time <= 12 , and dep_time >=12. In total there will be maximum 48 time bins (if 0, it is not necessary to have a bin).

The desired table should have the following structure:

linkId  time    count
90      0.0     3
90      0.5     x
90      1.0     y
...
389     0.0     z
...

My struggle is to create an efficient loop to do this operation.

Thank you in advance!

Data:

structure(list(linkId = c(90L, 90L, 90L, 90L, 90L, 90L, 90L, 
90L, 90L, 90L, 90L, 90L, 90L, 90L, 90L, 90L, 389L, 389L, 389L, 
389L, 389L, 389L, 389L, 389L, 389L, 389L, 389L, 389L, 389L, 389L, 
451L, 451L, 451L, 451L, 480L, 480L, 480L, 480L, 480L, 578L, 578L, 
578L, 578L, 578L, 578L, 578L, 662L, 662L, 662L, 662L, 662L, 662L, 
723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 
723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 
723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 
723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 723L, 
723L, 723L, 723L, 723L), x = c(2681090, 2681090, 2681090, 2681090, 
2681090, 2681090, 2681090, 2681090, 2681090, 2681090, 2681090, 
2681090, 2681090, 2681090, 2681090, 2681090, 2681366.83333333, 
2681366.83333333, 2681366.83333333, 2681366.83333333, 2681366.83333333, 
2681366.83333333, 2681366.83333333, 2681366.83333333, 2681366.83333333, 
2681366.83333333, 2681366.83333333, 2681366.83333333, 2681366.83333333, 
2681366.83333333, 2683684, 2683684, 2683684, 2683684, 2683675.34782609, 
2683675.34782609, 2683675.34782609, 2683675.34782609, 2683675.34782609, 
2676435, 2676435, 2676435, 2676435, 2676435, 2676435, 2676435, 
2682590, 2682590, 2682590, 2682590, 2682590, 2682590, 2672126, 
2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 
2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 
2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 
2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 
2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 
2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 2672126, 
2672126, 2672126, 2672126, 2672126, 2672126), y = c(1245442, 
1245442, 1245442, 1245442, 1245442, 1245442, 1245442, 1245442, 
1245442, 1245442, 1245442, 1245442, 1245442, 1245442, 1245442, 
1245442, 1247843.75, 1247843.75, 1247843.75, 1247843.75, 1247843.75, 
1247843.75, 1247843.75, 1247843.75, 1247843.75, 1247843.75, 1247843.75, 
1247843.75, 1247843.75, 1247843.75, 1246790, 1246790, 1246790, 
1246790, 1246835.5, 1246835.5, 1246835.5, 1246835.5, 1246835.5, 
1241381, 1241381, 1241381, 1241381, 1241381, 1241381, 1241381, 
1237645.6, 1237645.6, 1237645.6, 1237645.6, 1237645.6, 1237645.6, 
1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 
1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 
1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 
1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 
1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 
1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 1248783, 
1248783, 1248783, 1248783, 1248783, 1248783, 1248783), vehicleRefId = c(1267069L, 
532654L, 1398907L, 1267069L, 4205L, 1111105L, 434774L, 1179923L, 
46491L, 1179923L, 1326473L, 1239391L, 810534L, 51825L, 1199672L, 
1269433L, 492533L, 1454119L, 1278645L, 1430553L, 1412246L, 1533113L, 
1278645L, 1454119L, 1412246L, 1430553L, 1533113L, 1278645L, 1310775L, 
1310775L, 1161080L, 1290940L, 558745L, 628509L, 1530598L, 403850L, 
1397256L, 774916L, 3874L, 1043798L, 1881121L, 193835L, 526654L, 
2066556L, 1221124L, 12799L, 486288L, 485689L, 488147L, 485689L, 
486288L, 488147L, 2095866L, 42794L, 2149105L, 1887358L, 1902958L, 
1901830L, 1215125L, 2148165L, 1457624L, 1898426L, 1394390L, 1859644L, 
1908352L, 1885007L, 1885718L, 1887788L, 1222534L, 1888344L, 1926462L, 
1785664L, 2147547L, 1898186L, 1921295L, 1905635L, 1888247L, 1747951L, 
2149105L, 2821L, 1094609L, 1531804L, 1670344L, 1912658L, 1799420L, 
1908352L, 1925302L, 2064554L, 1887316L, 1869032L, 1925659L, 1794294L, 
1378838L, 1528492L, 4806833L, 5259385L, 1860654L, 1187619L, 1814856L, 
1863281L), new_arr_time = c(0, 11, 12.5, 12, 16.5, 11, 16, 0, 
14.5, 16, 11, 13, 8, 9.5, 11, 17.5, 14.5, 17.5, 0, 10.5, 18, 
13.5, 25, 0, 0, 12.5, 8.5, 17, 19, 0, 7.5, 7.5, 7.5, 7.5, 8.5, 
6, 13.5, 7.5, 14, 8, 10, 7.5, 18, 18, 9.5, 16, 18.5, 21, 0, 0, 
0, 18.5, 12, 19, 8, 9, 18, 14, 19, 10, 17, 12, 7, 13, 13.5, 11, 
14.5, 17, 9.5, 8.5, 8.5, 7, 6.5, 18.5, 22.5, 12.5, 18.5, 8, 14, 
6.5, 9.5, 8, 17.5, 17, 12.5, 8, 5.5, 18, 19.5, 7.5, 8.5, 13, 
18.5, 12, 15.5, 19, 20, 13, 8, 9.5), duration = c(6.5, 1, 0.5, 
24, 0, 0, 0, 15.5, 0.5, 19.5, 3, 0.5, 0, 0.5, 1, 1.5, 1.5, 18, 
8, 1.5, 17.5, 5, 11, 7, 7, 0.5, 4, 2, 16.5, 7.5, 10, 10, 10, 
9.5, 10.5, 8, 8.5, 9.5, 8, 0.5, 0.5, 3, 1, 1, 2.5, 0, 17.5, 15, 
13, 7, 8, 17.5, 1, 3.5, 4.5, 2.5, 2, 1.5, 4.5, 1, 1, 1, 10, 2, 
4, 1, 2.5, 2, 2, 1, 0.5, 10, 10.5, 5, 0, 3.5, 0, 10.5, 3, 9.5, 
1.5, 0, 3, 2.5, 0, 3, 5.5, 1.5, 1, 10, 1, 3, 0, 1, 1, 1.5, 2.5, 
1, 2.5, 0.5), dep_time = c(6.5, 12, 12.5, 36, 16.5, 11.5, 16.5, 
15.5, 15, 36, 13.5, 13.5, 8, 10, 12, 19, 16, 36, 8, 12, 36, 18.5, 
36, 7, 7, 13, 12, 19.5, 36, 7.5, 17.5, 17.5, 17.5, 17, 19, 14, 
22, 17, 22, 8.5, 11, 10, 19, 19, 12.5, 16.5, 36, 36, 13, 7, 8, 
36, 12.5, 22.5, 12.5, 11, 20, 15.5, 24, 10.5, 18, 12.5, 17, 14.5, 
17.5, 11.5, 17, 19, 12, 9.5, 9, 17, 17.5, 23.5, 22.5, 16.5, 18.5, 
19, 17, 16, 11, 8, 20.5, 19.5, 12.5, 11.5, 11, 19.5, 20.5, 17.5, 
9.5, 16, 18.5, 13, 16, 20.5, 22.5, 14, 10.5, 10)), row.names = c(NA, 
100L), class = "data.frame")

Upvotes: 1

Views: 374

Answers (2)

tpetzoldt
tpetzoldt

Reputation: 5838

I hope I understand it now better, here comes an approach based on an outer product and data mangling with package tidyr. It is more memory consuming than a loop, but also more compact:


library("tidyr")
library("dplyr")

## half hour time slots
tm <- seq(0, 24, 0.5)

## Test if a value is in the interval. Please check manually with some examples.
## second version is more robust against IEEE floating point deviations
# fun <- function(i, x) (d[i, "new_arr_time"] <= x) & (x <= d[i, "dep_time"])
fun <- function(i, x) (d[i, "new_arr_time"] - x < 1e-6) & (x - d[i, "dep_time"] < 1e-6)

## outer creates all combinations between LinkIDs and time slots
expanded <- data.frame(outer(1:nrow(d), tm, fun))
names(expanded) <- tm

cbind(linkId=d$linkId, expanded) %>%
  pivot_longer(-linkId, names_to = "time", values_to = "count") %>%
  group_by(linkId, time) %>%
  summarize(count = sum(count))

To understand what the individual steps in the %>%pipeline do, rebuild the pipe from scratch and add one line after each other.

Upvotes: 2

tpetzoldt
tpetzoldt

Reputation: 5838

I don't understand what you mean with:

have their new_arr_time <= 12 , and dep_time >=12

but if we assume, just for example, that it is enough to use duration (or another calculated value), one can do such an aggregation without any loop with aggregate in R base or with package dplyr:

d %>% group_by(linkId, duration) %>% summarize(count = n())

or with "base R" (without dplyr):

with(d, aggregate(list(count = linkId), list(linkId = linkId, duration=duration), length))

Upvotes: 0

Related Questions