Serban Dragne
Serban Dragne

Reputation: 81

Counting rows based upon conditional grouping with dplyr

I have a dataframe as follows:

        position_time telematic_trip_no  lat_dec lon_dec
1 2016-06-05 00:00:01         526132109 -26.6641 27.8733
2 2016-06-05 00:00:01         526028387 -26.6402 27.8059
3 2016-06-05 00:00:01         526081476 -26.5545 28.3263
4 2016-06-05 00:00:04         526140512 -26.5310 27.8704
5 2016-06-05 00:00:05         526140518 -26.5310 27.8704
6 2016-06-05 00:00:19         526006880 -26.5010 27.8490 
  is_stolen hour_of_day time_of_day day_of_week  lat_min
1         0           0           0      Sunday -26.6651
2         0           0           0      Sunday -26.6412
3         0           0           0      Sunday -26.5555
4         0           0           0      Sunday -26.5320
5         0           0           0      Sunday -26.5320
6         0           0           0      Sunday -26.5020
   lat_max lon_max lon_min 
1 -26.6631 27.8743 27.8723     
2 -26.6392 27.8069 27.8049    
3 -26.5535 28.3273 28.3253    
4 -26.5300 27.8714 27.8694      
5 -26.5300 27.8714 27.8694      
6 -26.5000 27.8500 27.8480     

Now what I want to do is count for each line where is_stolen = 1, the number of rows in the dataframe that fulfill the following conditions:

I've written a script to do this using a for loop but it ran very slowly and it got me thinking if there's an efficient way to do complex row counts with many conditions using something like dplyr or data.table?

ps If you're curious I am indeed trying to calculate how many cars a stolen-car passes during a typical trip :)

Upvotes: 1

Views: 1322

Answers (2)

Arun
Arun

Reputation: 118889

The current development version of data.table, v1.9.7 has a new feature non-equi joins, which makes conditional joins quite straightforward. Using @aichao's data:

require(data.table) # v1.9.7+
setDT(df)[, ID := .I] # add row numbers
not_stolen = df[is_stolen == 0L]
is_stolen  = df[is_stolen == 1L]

not_stolen[is_stolen, 
    .(ID = i.ID, N = .N - sum(telematic_trip_no == i.telematic_trip_no)), 
    on = .(time_of_day, day_of_week, lat_min <= lat_dec, 
          lat_max >= lat_dec, lon_min <= lon_dec, lon_max >= lon_dec), 
    by=.EACHI][, .(ID, N)]
#    ID  N
# 1:  7 NA
# 2:  8 NA
# 3:  9  0
# 4: 10  2
# 5: 11  1
# 6: 12 NA
# 7: 13 NA

The part not_stolen[is_stolen, performs a subset-like join operation.. i.e., for each row in is_stolen, matching row indices (based on condition provided to on= argument) is extracted.

by = .EACHI ensures that, for each row in i (first) argument, here is_stolen, on the corresponding matching row indices, the expression provided in j, the second argument, .(ID = i.ID, N = .N-sum(telematic_trip_no==i.telematic_trip_no)), is evaluated. That returns the result shown above.

HTH.

Upvotes: 1

aichao
aichao

Reputation: 7455

Given your description of the problem, the following should work

library(dplyr)
library(stats)
# df is the data.frame (see below)
df <- cbind(ID=seq_len(nrow(df)),df)
r.stolen <- which(df$is_stolen == 1)
r.not <- which(df$is_stolen != 1)
print(df[rep(r.not, times=length(r.stolen)),] %>%
  setNames(.,paste0(names(.),"_not")) %>%
    bind_cols(df[rep(r.stolen, each=length(r.not)),], .) %>% 
      mutate(in_range = as.numeric(telematic_trip_no != telematic_trip_no_not & time_of_day == time_of_day_not & day_of_week == day_of_week_not & lat_dec >= lat_min_not & lat_dec <= lat_max_not & lon_dec >= lon_min_not & lon_dec <= lon_max_not)) %>%
        group_by(ID) %>%
          summarise(count = sum(in_range)) %>% 
            arrange(desc(count)))

The first line just adds a column named ID to df that identifies the row by its row number that we can later dplyr::group_by to make the count.

The next two lines divides the rows into stolen and not-stolen cars. The key is to:

  1. replicate each row of stolen cars N times where N is the number of not-stolen car rows,
  2. replicate the rows of not-stolen cars (as a block) M times where M is the number of stolen car rows, and
  3. append the result of (2) to (1) as new columns and change the names of these new columns so that we can reference them in the condition

The result of (3) have rows that enumerates all pairs of stolen and not-stolen rows from the original data frame so that your condition can be applied in an array fashion. The dplyr piped R workflow that is the fourth line of the code (wrapped in a print()) does this:

  • the first command replicates the not-stolen car rows using times
  • the second command appends _not to the column names to distinguish them from the stolen car columns when we bind the columns. Thanks to this SO answer for that gem.
  • the third command replicates the stolen car rows using each and appends the previous result as new columns using dplyr::bind_cols
  • the fourth command uses dplyr::mutate to create a new column named in_range that is the result of applying the condition. The boolean result is converted to {0,1} to allow for easy accumulation
  • the rest of the commands in the pipe does the counting of in_range grouped by the ID and arranging the results in decreasing order of the count. Note that now ID is the column that identifies the rows of the original data frame for which is_stolen = 1 whereas ID_not is the column for rows that is_stolen = 0

This assumes that you want the count for each row that is_stolen = 1 in the original data frame, which is what you said in your question. If instead you really want the count for each telematic_trip_no that is stolen, then you can use

group_by(telematic_trip_no) %>%

in the pipe instead.

I've tested this using the following data snippet

df <- structure(list(position_time = structure(c(1L, 1L, 1L, 2L, 3L, 
                4L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("2016-06-05 00:00:01", 
                "2016-06-05 00:00:04", "2016-06-05 00:00:05", "2016-06-05 00:00:19", 
                "2016-06-05 00:00:20", "2016-06-05 00:00:22", "2016-06-05 00:00:23", 
                "2016-06-05 00:00:35", "2016-06-05 00:09:34", "2016-06-06 01:00:06"
                ), class = "factor"), telematic_trip_no = c(526132109L, 526028387L, 
                526081476L, 526140512L, 526140518L, 526006880L, 526017880L, 526027880L, 
                526006880L, 526006890L, 526106880L, 526005880L, 526007880L), 
                lat_dec = c(-26.6641, -26.6402, -26.5545, -26.531, -26.531, 
                -26.501, -26.5315, -26.5325, -26.501, -26.5315, -26.5007, 
                -26.5315, -26.5315), lon_dec = c(27.8733, 27.8059, 28.3263, 
                27.8704, 27.8704, 27.849, 27.88, 27.87, 27.849, 27.87, 27.8493, 
                27.87, 27.87), is_stolen = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 
                1L, 1L, 1L, 1L, 1L, 1L), hour_of_day = c(0L, 0L, 0L, 0L, 
                0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), time_of_day = c(0L, 
                0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 9L, 0L), day_of_week = structure(c(2L, 
                2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("Monday", 
                "Sunday"), class = "factor"), lat_min = c(-26.6651, -26.6412, 
                -26.5555, -26.532, -26.532, -26.502, -26.532, -26.532, -26.502, 
                -26.532, -26.502, -26.532, -26.532), lat_max = c(-26.6631, 
                -26.6392, -26.5535, -26.53, -26.53, -26.5, -26.53, -26.53, 
                -26.5, -26.53, -26.5, -26.53, -26.53), lon_max = c(27.8743, 
                27.8069, 28.3273, 27.8714, 27.8714, 27.85, 27.8714, 27.8714, 
                27.85, 27.8714, 27.85, 27.8714, 27.8714), lon_min = c(27.8723, 
                27.8049, 28.3253, 27.8694, 27.8694, 27.848, 27.8694, 27.8694, 
                27.848, 27.8694, 27.848, 27.8694, 27.8694)), .Names = c("position_time", 
                "telematic_trip_no", "lat_dec", "lon_dec", "is_stolen", "hour_of_day", 
                "time_of_day", "day_of_week", "lat_min", "lat_max", "lon_max", 
                "lon_min"), class = "data.frame", row.names = c(NA, -13L))

Here, I appended 7 new rows with is_stolen = 1 to your original 6 rows that are all is_stolen = 0:

  1. the first added row with telematic_trip_no = 526005880 violates the longitude condition for all not-stolen rows, so its count should be 0
  2. the second added row with telematic_trip_no = 526006880 violates the latitude condition for all not-stolen rows, so its count should be 0
  3. the third added row with telematic_trip_no = 526007880 violates the telematic_trip_no condition for all not-stolen rows, so its count should be 0
  4. the fourth added row with telematic_trip_no = 526006890 satisfies the condition for rows 4 and 5 that are not-stolen, so its count should be 2
  5. the fifth added row with telematic_trip_no = 526106880 satisfies the condition for row 6 that is not-stolen, so its count should be 1
  6. the sixth added row with telematic_trip_no = 526017880 violates the time_of_day condition for all not-stolen rows, so its count should be 0
  7. the seventh added row with telematic_trip_no = 526027880 violates the day_of_week condition for all not-stolen rows, so its count should be 0

Running the code on this data gives:

# A tibble: 7 x 2
     ID count
  <int> <dbl>
1    10     2
2    11     1
3     7     0
4     8     0
5     9     0
6    12     0
7    13     0

which is as expected recalling that the appended rows with is_stolen = 1 starts at row 7 with ID = 7.

If one were to group by telematic_trip_no instead, we get the result:

# A tibble: 7 x 2
  telematic_trip_no count
              <int> <dbl>
1         526006890     2
2         526106880     1
3         526005880     0
4         526006880     0
5         526007880     0
6         526017880     0
7         526027880     0

As a caveat, the above approach does cost memory. Worst case the number of rows grows to N^2/4 where N is the number of rows in the original data frame, and the number of columns doubles for the data frame that is used to evaluate the condition. As with most array processing techniques, there is a trade between speed and memory.

Hope this helps.

Upvotes: 1

Related Questions