TheNaidge
TheNaidge

Reputation: 89

R calculating time differences in a (layered) long dataset

I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.

I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.

The df is structured as follows, in a long format:

 df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),  
                  "Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
                  "Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))

Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.

What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.

Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.

I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!

Upvotes: 2

Views: 598

Answers (2)

Jason
Jason

Reputation: 2597

Using base R (no extra packages):

  1. sort the data, ordering by customer Id, then by timestamp.
  2. calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
  3. find the average
  4. squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged 
df <- df[order(df$Customer, df$Timestamp),]

# 2 apply a diff. 
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.

# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)

# 3 find the averages
diffs.mean <- lapply(diffs, mean)

# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)

# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df

You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.

Upvotes: 2

Austin Graves
Austin Graves

Reputation: 1052

Using dplyr:

df %>%
  arrange(Customer, Timestamp) %>%
  group_by(Customer) %>%
  mutate(Difference = Timestamp - lag(Timestamp)) %>%
  summarise(mean(Difference, na.rm = TRUE))

Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.

Upvotes: 3

Related Questions