elleewhy
elleewhy

Reputation: 21

R loop - read and aggregate from matrix - output to new matrix by two unique variables

I have a programming question in R that I've been looking for a solution to but can't seem to find online. I am working with a data set that tracks the location of hospital providers over time. So there is a unique ID identifying the staff and various time stamps with associated locations. I have a distance matrix that describes how far certain locations (A through E) are to each other and am interested in aggregating this across the entire data set (calculating total distance walked).

My distance matrix is a 5x5 matrix describing movement from Section X to Section Y, where X and Y belong to the subset of Sections A through E. The following describes the distance matrix:

dmatrix=matrix(c(1, 2, 2, 3, 4, 2, 1, 2, 3, 4, 2, 2, 1, 2, 3, 3, 3, 2, 1, 2, 4, 4, 3, 2, 1), nrow=5, ncol=5)
colnames(dmatrix)<-c("sectionA", "sectionB", "sectionC", "sectionD", "sectionE")
rownames(dmatrix)<-c("sectionA", "sectionB", "sectionC", "sectionD", "sectionE")

My data frame is a 5 million line data set that contains the locations where all staff members are by the second. The objective is to calculate the total distance walked on any given day (by reading from the distance matrix above) per staff ID.

I am able to successfully do this by sub-setting the data by date AND ID by using the following code:

jun10<-subset(dataframe,format(dataframe$st,'%m/%d')=='06/10')

jun1013<-jun10[jun10$id==13,]
jun1013[with(jun1013, order(st)),]
for(i in 1:nrow(jun1013))
{
jun1013$distance[i]=dmatrix[cbind(as.character(jun1013$section[i-1]),as.character(jun1013$section[i]))]
}
sum(jun1013$distance, na.rm=TRUE)

Note: "dataframe" is the dataframe name, dataframe$st is the POSIXlt time stamp data, in this case I am extracting all entries from June 10th, and then all entries on June 10th for ID number 13. This code gives me the total distance walked on June 10th by Staff #13.

Here is output showing head(dataframe):

   id                   room              start               stop duration               roomname       starttimelocal         endtimelocal durationseconds source resident attending1 attending2      unit     X_merge
1 104 ED-BCKNST (STAFF ROOM) 28feb2013 13:42:45 28feb2013 13:42:47        2 ED-BCKNST (STAFF ROOM) 2/28/2013 1:42:45 PM 2/28/2013 1:42:47 PM        00:00:02          1        0          0          0 EMERGENCY matched (3)
2 104 ED-BCKNST (STAFF ROOM) 28feb2013 13:37:46 28feb2013 13:37:51        5 ED-BCKNST (STAFF ROOM) 2/28/2013 1:37:46 PM 2/28/2013 1:37:51 PM        00:00:05      1        0          0          0 EMERGENCY matched (3)
3 104  ED-PELEV (STAFF ROOM) 14may2013 09:08:31 14may2013 09:08:35        4  ED-PELEV (STAFF ROOM) 5/14/2013 9:08:31 AM 5/14/2013 9:08:35 AM        00:00:04      2        0          0          0 EMERGENCY matched (3)
4 104 ED-BCKNST (STAFF ROOM) 28feb2013 09:34:34 28feb2013 09:34:38        4 ED-BCKNST (STAFF ROOM) 2/28/2013 9:34:34 AM 2/28/2013 9:34:38 AM        00:00:04      1        0          0          0 EMERGENCY matched (3)
5 104 ED-BCKNST (STAFF ROOM) 28feb2013 09:27:02 28feb2013 09:27:09        7 ED-BCKNST (STAFF ROOM) 2/28/2013 9:27:02 AM 2/28/2013 9:27:09 AM        00:00:07      1        0          0          0 EMERGENCY matched (3)
6 104 ED-BCKNST (STAFF ROOM) 26feb2013 19:07:56 26feb2013 19:08:01        5 ED-BCKNST (STAFF ROOM) 2/26/2013 7:07:56 PM 2/26/2013 7:08:01 PM        00:00:05      1        0          0          0 EMERGENCY matched (3)
                  st   categ  section   day
1 2013-02-28 01:42:45 staffrm sectionE 02/28
2 2013-02-28 01:37:46 staffrm sectionE 02/28
3 2013-05-14 09:08:31 staffrm sectionE 05/14
4 2013-02-28 09:34:34 staffrm sectionE 02/28
5 2013-02-28 09:27:02 staffrm sectionE 02/28
6 2013-02-26 07:07:56 staffrm sectionE 02/26

How do I do this for all of the data? I've tried to create if loops to do this by date and by ID but I keep getting errors or NAs in the fields.

The perfect solution would go through the data, and calculate the total distance walked similar to what I've done above, and then output to a matrix where each column describes a unique day (like June 10, 11, 12, etc.) and the rows are the unique IDs working on a given day. The entries in the matrix would be the sums describing total distance walked. Note that the number of staff working on each day differ.

Not sure if this is even possible, I've been stuck on this for over a week now - any help, insight, or advice would be tremendously helpful - thank you!!

Upvotes: 1

Views: 362

Answers (2)

bgoldst
bgoldst

Reputation: 35314

Input

To work on this problem I synthesized my own data. I used NS as the number of staff, ND as the total number of days in the period under examination, and NSD as the number of days worked by each staff member during the period under examination, assumed to be uniform (but that's not required for my solution). So, you can play with those numbers to synthesize different inputs, but for this demo, I use 3 staff, a 5 day period, and 3 days worked by each staff member within the overall 5 day period.

library('data.table');

## synthesize data
set.seed(1);
sec <- c('sectionA','sectionB','sectionC','sectionD','sectionE');
dmatrix <- matrix(c(0,2,2,3,4,2,0,2,3,4,2,2,0,2,3,3,3,2,0,2,4,4,3,2,0),5,dimnames=list(sec,sec));
NS <- 3; ND <- 5; NSD <- 3; loc <- data.table(id=rep(1:NS,each=NSD*8*60),st=as.POSIXlt('2015-06-10 09:00:00')+rep(replicate(NS,sort(sample(0:(ND-1),NSD))*86400),each=8*60)+seq(0,by=1,len=8*60)*60,section=do.call(c,replicate(NS*NSD,{ m <- 8L*60L; ls <- integer(); while (m > 0L) { ls[length(ls)+1L] <- as.integer(min(m,runif(1,10,100))); m <- m-ls[length(ls)]; }; rep(sample(sec,length(ls),replace=T),ls); },simplify=F)));
setkey(loc,id,st);
loc;
##       id                  st  section
##    1:  1 2015-06-11 09:00:00 sectionB
##    2:  1 2015-06-11 09:01:00 sectionB
##    3:  1 2015-06-11 09:02:00 sectionB
##    4:  1 2015-06-11 09:03:00 sectionB
##    5:  1 2015-06-11 09:04:00 sectionB
##   ---
## 4316:  3 2015-06-14 16:55:00 sectionE
## 4317:  3 2015-06-14 16:56:00 sectionE
## 4318:  3 2015-06-14 16:57:00 sectionE
## 4319:  3 2015-06-14 16:58:00 sectionE
## 4320:  3 2015-06-14 16:59:00 sectionE

Notes:

  • As you can see, I kept your dmatrix lookup table. However, I changed the numbers for same-section cells from 1 to 0 because there should be zero distance walked while a staff member remains in the same section, right? Please correct me if I'm missing something wrt that.
  • I used the data.table package, which usually has a noticeable (and sometimes crucial) performance benefit.
  • I used a regular time sequence, but to keep the data more manageable, I used minute increments rather than seconds.
  • I assumed that all minutes worked are in the time range of 9am to 5pm, but that's not required for my solution.

Solution

If my understanding is correct, you want to sum up the total distance walked between sections by each staff member on each day the staff member worked during the time period under examination.

First, I wrote this little helper function that takes a vector of sections and returns a vector of distances, representing the distance from the previous section to the current section. This always returns zero as the first element of the returned distance vector, because there's no previous section for the first section in the input vector.

## vectorized section distance helper function
getDist <- function(secvec) c(0,dmatrix[cbind(secvec[-1],secvec[-length(secvec)])]);

Now, we can use a nice little one-liner based on data.table indexing syntax to achieve the requirement:

## calculate distance walked for each staff id and each day worked
loc[,.(dist=sum(getDist(rle(section)$values),na.rm=T)),.(id,day=as.Date(st))];
##    id        day dist
## 1:  1 2015-06-11   19
## 2:  1 2015-06-13   15
## 3:  1 2015-06-14   13
## 4:  2 2015-06-10   21
## 5:  2 2015-06-12   18
## 6:  2 2015-06-14   15
## 7:  3 2015-06-11   17
## 8:  3 2015-06-12   17
## 9:  3 2015-06-14   18

This groups by id and the day worked, which is calculated dynamically in the grouping argument as day=as.Date(st), and computes the sum of the distances walked. To get those distances, we actually don't care about every adjacent pair of sections recorded in the location table, because usually they will be identical and thus will not represent any distance walked. Thus, I used the base R rle() function to reduce the section vector for the group to just the run-lengths of each section. rle() returns a list classed as 'rle' that contains two components: lengths and values, but we don't care about the lengths of time spent in each section, so I just pulled out the values, which is our section vector ready to pass to getDist().

This is an extremely fast solution. For example, below I generate input data with 200 staff and 60 days worked during a 100 day examination period, resulting in 5.76 million records in the location table, and it completes in about a second:

set.seed(1);
NS <- 200; ND <- 100; NSD <- 60; loc <- data.table(id=rep(1:NS,each=NSD*8*60),st=as.POSIXlt('2015-06-10 09:00:00')+rep(replicate(NS,sort(sample(0:(ND-1),NSD))*86400),each=8*60)+seq(0,by=1,len=8*60)*60,section=do.call(c,replicate(NS*NSD,{ m <- 8L*60L; ls <- integer(); while (m > 0L) { ls[length(ls)+1L] <- as.integer(min(m,runif(1,10,100))); m <- m-ls[length(ls)]; }; rep(sample(sec,length(ls),replace=T),ls); },simplify=F)));
setkey(loc,id,st);
loc;
##           id                  st  section
##       1:   1 2015-06-10 09:00:00 sectionD
##       2:   1 2015-06-10 09:01:00 sectionD
##       3:   1 2015-06-10 09:02:00 sectionD
##       4:   1 2015-06-10 09:03:00 sectionD
##       5:   1 2015-06-10 09:04:00 sectionD
##      ---
## 5759996: 200 2015-09-14 16:55:00 sectionB
## 5759997: 200 2015-09-14 16:56:00 sectionB
## 5759998: 200 2015-09-14 16:57:00 sectionB
## 5759999: 200 2015-09-14 16:58:00 sectionB
## 5760000: 200 2015-09-14 16:59:00 sectionB
loc[,.(dist=sum(getDist(rle(section)$values),na.rm=T)),.(id,day=as.Date(st))];
##         id        day dist
##     1:   1 2015-06-10   16
##     2:   1 2015-06-11   21
##     3:   1 2015-06-13   23
##     4:   1 2015-06-14   19
##     5:   1 2015-06-15   20
##    ---
## 11996: 200 2015-09-09   22
## 11997: 200 2015-09-10   31
## 11998: 200 2015-09-11   21
## 11999: 200 2015-09-13   17
## 12000: 200 2015-09-14   17

Reshaping

Just noticed in your question you say the perfect solution would have the unique days as columns, and the staff ids as rows. You can achieve this with the base R reshape() function. For the below demo, I use the original simplified input (the one with 4320 rows), whose output I now assume has been assigned to locd:

reshape(locd[order(day)],dir='w',timevar='day')[order(id)];
##    id dist.2015-06-10 dist.2015-06-11 dist.2015-06-12 dist.2015-06-13 dist.2015-06-14
## 1:  1              NA              19              NA              15              13
## 2:  2              21              NA              18              NA              15
## 3:  3              NA              17              17              NA              18

Notice that the cells whose staff id (row) did not work that day (column) have NA as the distance, which is reasonable.

Upvotes: 1

Alexander Heath
Alexander Heath

Reputation: 116

To do this, I made a function distanceWalked, which calculates the distance traveled for each row except the first.

distanceWalked <- function(data) {
  data$distance[1] <- 0
  if (nrow(data) > 1) {
    for (i in 2:nrow(data)) {
      data$distance[i] <- dmatrix[data$section[i-1],data$section[i]]
    }
  }
  return(data)
}

I then made a new data table which has all unique combinations of day and id

unique_combos <- unique(data.table(date = dt$day, id = dt$id))

Then I ran a for loop which subsets the data, chronologically orders it by the st column, run distanceWalked on it, and then aggregate it to a new data table

new_data <- data.table()
for (i in 1:nrow(unique_combos)) {
  dt_sub <- dt[dt$day == unique_combos$date[i] & dt$id == unique_combos$id[i]]
  setorder(dt_sub, st)
  dt_sub <- distanceWalked(dt_sub)
  new_data <- rbind(new_data, dt_sub)
}

I then used the dplyr package to find the sum of distance by each unique combination of day and id

library(dplyr)

final_data <- new_data %>% group_by(day, id) %>% summarize(total_distance = sum(distance))

It should yield something like this

    day  id total_distance
1 02/28 104              3
2 05/14 104              0
3 02/26 104              0

This might take awhile to complete for 5 million rows, but it should get you where you need to go!

Upvotes: 0

Related Questions