Reputation: 81
I have a dataframe as follows:
position_time telematic_trip_no lat_dec lon_dec
1 2016-06-05 00:00:01 526132109 -26.6641 27.8733
2 2016-06-05 00:00:01 526028387 -26.6402 27.8059
3 2016-06-05 00:00:01 526081476 -26.5545 28.3263
4 2016-06-05 00:00:04 526140512 -26.5310 27.8704
5 2016-06-05 00:00:05 526140518 -26.5310 27.8704
6 2016-06-05 00:00:19 526006880 -26.5010 27.8490
is_stolen hour_of_day time_of_day day_of_week lat_min
1 0 0 0 Sunday -26.6651
2 0 0 0 Sunday -26.6412
3 0 0 0 Sunday -26.5555
4 0 0 0 Sunday -26.5320
5 0 0 0 Sunday -26.5320
6 0 0 0 Sunday -26.5020
lat_max lon_max lon_min
1 -26.6631 27.8743 27.8723
2 -26.6392 27.8069 27.8049
3 -26.5535 28.3273 28.3253
4 -26.5300 27.8714 27.8694
5 -26.5300 27.8714 27.8694
6 -26.5000 27.8500 27.8480
Now what I want to do is count for each line where is_stolen = 1, the number of rows in the dataframe that fulfill the following conditions:
I've written a script to do this using a for loop but it ran very slowly and it got me thinking if there's an efficient way to do complex row counts with many conditions using something like dplyr or data.table?
ps If you're curious I am indeed trying to calculate how many cars a stolen-car passes during a typical trip :)
Upvotes: 1
Views: 1322
Reputation: 118889
The current development version of data.table, v1.9.7 has a new feature non-equi joins, which makes conditional joins quite straightforward. Using @aichao's data:
require(data.table) # v1.9.7+
setDT(df)[, ID := .I] # add row numbers
not_stolen = df[is_stolen == 0L]
is_stolen = df[is_stolen == 1L]
not_stolen[is_stolen,
.(ID = i.ID, N = .N - sum(telematic_trip_no == i.telematic_trip_no)),
on = .(time_of_day, day_of_week, lat_min <= lat_dec,
lat_max >= lat_dec, lon_min <= lon_dec, lon_max >= lon_dec),
by=.EACHI][, .(ID, N)]
# ID N
# 1: 7 NA
# 2: 8 NA
# 3: 9 0
# 4: 10 2
# 5: 11 1
# 6: 12 NA
# 7: 13 NA
The part not_stolen[is_stolen,
performs a subset-like join operation.. i.e., for each row in is_stolen
, matching row indices (based on condition provided to on=
argument) is extracted.
by = .EACHI
ensures that, for each row in i
(first) argument, here is_stolen
, on the corresponding matching row indices, the expression provided in j
, the second argument, .(ID = i.ID, N = .N-sum(telematic_trip_no==i.telematic_trip_no)),
is evaluated. That returns the result shown above.
HTH.
Upvotes: 1
Reputation: 7455
Given your description of the problem, the following should work
library(dplyr)
library(stats)
# df is the data.frame (see below)
df <- cbind(ID=seq_len(nrow(df)),df)
r.stolen <- which(df$is_stolen == 1)
r.not <- which(df$is_stolen != 1)
print(df[rep(r.not, times=length(r.stolen)),] %>%
setNames(.,paste0(names(.),"_not")) %>%
bind_cols(df[rep(r.stolen, each=length(r.not)),], .) %>%
mutate(in_range = as.numeric(telematic_trip_no != telematic_trip_no_not & time_of_day == time_of_day_not & day_of_week == day_of_week_not & lat_dec >= lat_min_not & lat_dec <= lat_max_not & lon_dec >= lon_min_not & lon_dec <= lon_max_not)) %>%
group_by(ID) %>%
summarise(count = sum(in_range)) %>%
arrange(desc(count)))
The first line just adds a column named ID
to df
that identifies the row by its row number that we can later dplyr::group_by
to make the count.
The next two lines divides the rows into stolen and not-stolen cars. The key is to:
N
times where N
is the number of not-stolen car rows,M
times where M
is the number of stolen car rows, andThe result of (3) have rows that enumerates all pairs of stolen and not-stolen rows from the original data frame so that your condition can be applied in an array fashion. The dplyr
piped R workflow that is the fourth line of the code (wrapped in a print()
) does this:
times
_not
to the column names to distinguish them from the stolen car columns when we bind the columns. Thanks to this SO answer for that gem.each
and appends the previous result as new columns using dplyr::bind_cols
dplyr::mutate
to create a new column named in_range
that is the result of applying the condition. The boolean result is converted to {0,1}
to allow for easy accumulationin_range
grouped by the ID
and arranging the results in decreasing order of the count. Note that now ID
is the column that identifies the rows of the original data frame for which is_stolen = 1
whereas ID_not
is the column for rows that is_stolen = 0
This assumes that you want the count for each row that is_stolen = 1
in the original data frame, which is what you said in your question. If instead you really want the count for each telematic_trip_no
that is stolen, then you can use
group_by(telematic_trip_no) %>%
in the pipe instead.
I've tested this using the following data snippet
df <- structure(list(position_time = structure(c(1L, 1L, 1L, 2L, 3L,
4L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("2016-06-05 00:00:01",
"2016-06-05 00:00:04", "2016-06-05 00:00:05", "2016-06-05 00:00:19",
"2016-06-05 00:00:20", "2016-06-05 00:00:22", "2016-06-05 00:00:23",
"2016-06-05 00:00:35", "2016-06-05 00:09:34", "2016-06-06 01:00:06"
), class = "factor"), telematic_trip_no = c(526132109L, 526028387L,
526081476L, 526140512L, 526140518L, 526006880L, 526017880L, 526027880L,
526006880L, 526006890L, 526106880L, 526005880L, 526007880L),
lat_dec = c(-26.6641, -26.6402, -26.5545, -26.531, -26.531,
-26.501, -26.5315, -26.5325, -26.501, -26.5315, -26.5007,
-26.5315, -26.5315), lon_dec = c(27.8733, 27.8059, 28.3263,
27.8704, 27.8704, 27.849, 27.88, 27.87, 27.849, 27.87, 27.8493,
27.87, 27.87), is_stolen = c(0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), hour_of_day = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), time_of_day = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 9L, 0L), day_of_week = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("Monday",
"Sunday"), class = "factor"), lat_min = c(-26.6651, -26.6412,
-26.5555, -26.532, -26.532, -26.502, -26.532, -26.532, -26.502,
-26.532, -26.502, -26.532, -26.532), lat_max = c(-26.6631,
-26.6392, -26.5535, -26.53, -26.53, -26.5, -26.53, -26.53,
-26.5, -26.53, -26.5, -26.53, -26.53), lon_max = c(27.8743,
27.8069, 28.3273, 27.8714, 27.8714, 27.85, 27.8714, 27.8714,
27.85, 27.8714, 27.85, 27.8714, 27.8714), lon_min = c(27.8723,
27.8049, 28.3253, 27.8694, 27.8694, 27.848, 27.8694, 27.8694,
27.848, 27.8694, 27.848, 27.8694, 27.8694)), .Names = c("position_time",
"telematic_trip_no", "lat_dec", "lon_dec", "is_stolen", "hour_of_day",
"time_of_day", "day_of_week", "lat_min", "lat_max", "lon_max",
"lon_min"), class = "data.frame", row.names = c(NA, -13L))
Here, I appended 7
new rows with is_stolen = 1
to your original 6
rows that are all is_stolen = 0
:
telematic_trip_no = 526005880
violates the longitude condition for all not-stolen rows, so its count should be 0
telematic_trip_no = 526006880
violates the latitude condition for all not-stolen rows, so its count should be 0
telematic_trip_no = 526007880
violates the telematic_trip_no
condition for all not-stolen rows, so its count should be 0
telematic_trip_no = 526006890
satisfies the condition for rows 4
and 5
that are not-stolen, so its count should be 2
telematic_trip_no = 526106880
satisfies the condition for row 6
that is not-stolen, so its count should be 1
telematic_trip_no = 526017880
violates the time_of_day
condition for all not-stolen rows, so its count should be 0
telematic_trip_no = 526027880
violates the day_of_week
condition for all not-stolen rows, so its count should be 0
Running the code on this data gives:
# A tibble: 7 x 2
ID count
<int> <dbl>
1 10 2
2 11 1
3 7 0
4 8 0
5 9 0
6 12 0
7 13 0
which is as expected recalling that the appended rows with is_stolen = 1
starts at row 7
with ID = 7
.
If one were to group by telematic_trip_no
instead, we get the result:
# A tibble: 7 x 2
telematic_trip_no count
<int> <dbl>
1 526006890 2
2 526106880 1
3 526005880 0
4 526006880 0
5 526007880 0
6 526017880 0
7 526027880 0
As a caveat, the above approach does cost memory. Worst case the number of rows grows to N^2/4
where N
is the number of rows in the original data frame, and the number of columns doubles for the data frame that is used to evaluate the condition. As with most array processing techniques, there is a trade between speed and memory.
Hope this helps.
Upvotes: 1