Emily
Emily

Reputation: 899

How to apply a function to increasing subsets of data in a data frame

I wish to apply a set of pre-written functions to subsets of data in a data frame that progressively increase in size. In this example, the pre-written functions calculate 1) the distance between each consecutive pair of locations in a series of data points, 2) the total distance of the series of data points (sum of step 1), 3) the straight line distance between the start and end location of the series of data points and 4) the ratio between the straight line distance (step3) and the total distance (step 2). I wish to know how to apply these steps (and consequently similar functions) to sub-groups of increasing size within a data frame. Below are some example data and the pre-written functions.

Example data:

> dput(df)
structure(list(latitude = c(52.640715, 52.940366, 53.267749, 
53.512608, 53.53215, 53.536443), longitude = c(3.305727, 3.103194, 
2.973257, 2.966621, 3.013587, 3.002674)), .Names = c("latitude", 
"longitude"), class = "data.frame", row.names = c(NA, -6L))

  Latitude Longitude
1 52.64072  3.305727
2 52.94037  3.103194
3 53.26775  2.973257
4 53.51261  2.966621
5 53.53215  3.013587
6 53.53644  3.002674

Pre-written functions:

# Step 1: To calculate the distance between a pair of locations
pairdist = sapply(2:nrow(df), function(x) with(df, trackDistance(longitude[x-1], latitude[x-1], longitude[x], latitude[x], longlat=TRUE))) 
# Step 2: To sum the total distance between all locations
totdist = sum(pairdist)
# Step 3: To calculate the distance between the first and end location 
straight = trackDistance(df[1,2], df[1,1], df[nrow(df),2], df[nrow(df),1], longlat=TRUE)
# Step 4: To calculate the ratio between the straightline distance & total distance
distrat = straight/totdist

I would like to apply the functions firstly to a sub-group of only the first two rows (i.e. rows 1-2), then to a subgroup of the first three rows (rows 1-3), then four rows…and so on…until I get to the end of the data frame (in the example this would be a sub-group containing rows 1-6, but it would be nice to know how to apply this to any data frame).

Desired output:

Subgroup  Totdist   Straight    Ratio
1         36.017     36.017     1.000                  
2         73.455     73.230     0.997
3        100.694     99.600     0.989
4        104.492    101.060     0.967
5        105.360    101.672     0.965

I have attempted to do this with no success and at the moment this is beyond my ability. Any advice would be very much appreciated!

Upvotes: 3

Views: 166

Answers (1)

Joris Meys
Joris Meys

Reputation: 108543

There's a lot of optimization that can be done.

  • trackDistance() is vectorized, so you don't need apply for that.
  • to get a vectorized way of calculating the total distance, use cumsum()
  • You only need to calculate the pairwise distances once. Recalculating that every time you look at a different subset, is a waste of resources. So try to thinkg in terms of the complete data frame when constructing your functions.

To get everything in one function that outputs the desired data frame, you can do something along those lines :

myFun <- function(x){
  # This is just to make typing easier in the rest of the function
  lat <- x[["Latitude"]]
  lon <- x[["Longitude"]]
  nr <- nrow(x)

  pairdist <-trackDistance(lon[-nr],lat[-nr],
                           lon[-1],lat[-1],
                           longlat=TRUE)

  totdist <- cumsum(pairdist)

  straight <- trackDistance(rep(lon[1],nr-1),
                            rep(lat[1],nr-1),
                            lon[-1],lat[-1],
                            longlat=TRUE)

  ratio <- straight/totdist
  data.frame(totdist,straight,ratio)

}

Proof of concept:

> myFun(df)
    totdist  straight     ratio
1  36.01777  36.01777 1.0000000
2  73.45542  73.22986 0.9969293
3 100.69421  99.60013 0.9891346
4 104.49261 101.06023 0.9671519
5 105.35956 101.67203 0.9650005

Note that you can add extra arguments to define the latitude and longitude columns. And watch your capitalization, in your question you use Latitude in the data frame, but latitude (small l) in your code.

Upvotes: 2

Related Questions