Manipulating Data in R

Question

I have data a data frame in the following structure

transaction | customer | week | amount
12551       | ieeamo   | 32   | €23.54
12553       | ieeamo   | 33   | €17.00

I would like to get it in the following structure (for all weeks)

week | customer | activity last week | activity 2 weeks ago
32   | ieeamo   | €0.00              | €0.00
33   | ieeamo   | €23.54             | €0.00
34   | ieeamo   | €17.00             | €23.54             
35   | ieeamo   | €0.00              | €17.00

Essentially, I am trying to convert transactional data to relative data.

My thoughts are that the best way to do this is to use loops to generate many dataframes then rbind them all at the end. However this approach does not seem efficient, and i'm not sure it will scale to the data I am using.

Is there a more proper solution?

Dinre · Accepted Answer

Rbinding is a bad idea for this, since each rbind creates a new copy of the data frame in memory. We can get to the answer more quickly with a mostly vectorized approach, using loops only to make code more concise. Props to the OP for recognizing the inefficiency and searching for a solution.

Note: The following solution will work for any number of customers, but would require minor modification to work with more lag columns.

Setup: First we need to generate some data to work with. I'm going to use two different customers with a few weeks of transactional data each, like so:

data <- read.table(text="
    transaction customer week amount
    12551       cOne     32   1.32
    12552       cOne     34   1.34
    12553       cTwo     34   2.34
    12554       cTwo     35   2.35
    12555       cOne     36   1.36
    12556       cTwo     37   1.37
    ", header=TRUE)

Step 1: Calculate some variables and initialize new data frame. To make the programming really easy, we first want to know two things: how many customers and how many weeks? We calculate those answers like so:

customer_list <- unique(data$customer)
# cOne cTwo

week_span <- min(data$week):max(data$week)
# 32 33 34 35 36 37

Next, we need to initialize the new data frame based on the variables we just calculated. In this new data frame, we need an entry for every week, not just the weeks in the data. This is where our 'week_span' variable comes in useful.

new_data <- data.frame(
    week=sort(rep(week_span,length(customer_list))),
    customer=customer_list,
    activity_last_week=NA,
    activity_2_weeks_ago=NA)

#    week customer activity_last_week activity_2_weeks_ago
# 1    32     cOne                 NA                   NA
# 2    32     cTwo                 NA                   NA
# 3    33     cOne                 NA                   NA
# 4    33     cTwo                 NA                   NA
# 5    34     cOne                 NA                   NA
# 6    34     cTwo                 NA                   NA
# 7    35     cOne                 NA                   NA
# 8    35     cTwo                 NA                   NA
# 9    36     cOne                 NA                   NA
# 10   36     cTwo                 NA                   NA
# 11   37     cOne                 NA                   NA
# 12   37     cTwo                 NA                   NA

You'll notice we repeat the week list for each customer and sort it, so we get a list resembling 1,1,2,2,3,3,4,4...n,n with a number of repetitions equal to the number of customers in the data. This makes it so we can specify the 'customer' data as just the list of customers, since the list will repeat to fill up the space. The lag columns are left as NA for now.

Step 2: Fill in the lag values. Now, things are pretty simple. We just need to grab the subset of rows for each customer and find out if there were any transactions for each week. We do this by using the 'match' function to pull out values for every week. Where data does not exist, we'll get an NA value and need to replace those with zeros (assuming no activity means a zero transaction). Then, for the lag columns, we just offset the values with NA depending on the number of weeks we are lagging.

# Loop through the customers.
for (i in 1:length(customer_list)){
    # Select the next customer's data.
    subset <- data[data$customer==customer_list[i],]

    # Extract the data values for each week.
    subset_amounts <- subset$amount[match(week_span, subset$week)]

    # Replace NA with zero.
    subset_amounts <- ifelse(is.na(subset_amounts),0,subset_amounts)

    # Loop through the lag columns.   
    for (lag in 1:2){

        # Write in the data values with the appropriate
        # number of offsets according to the lag.
        # Truncate the extra values.
        new_data[new_data$customer==customer_list[i], (2+lag)] <- c(rep(NA,lag), subset_amounts[1:(length(subset_amounts)-lag)])
    }
}

#    week customer activity_last_week activity_2_weeks_ago
# 1    32     cOne                 NA                   NA
# 2    32     cTwo                 NA                   NA
# 3    33     cOne               1.32                   NA
# 4    33     cTwo               0.00                   NA
# 5    34     cOne               0.00                 1.32
# 6    34     cTwo               0.00                 0.00
# 7    35     cOne               1.34                 0.00
# 8    35     cTwo               2.34                 0.00
# 9    36     cOne               0.00                 1.34
# 10   36     cTwo               2.35                 2.34
# 11   37     cOne               1.36                 0.00
# 12   37     cTwo               0.00                 2.35

In other situations... If you have a series of ordered time data where no rows are missing, this sort of task becomes incredibly simple with the 'embed' function. Let's say we have some data that looks like this:

data <- data.frame(week=1:20, value=1:20+(1:20/100))

#    week value
# 1     1  1.01
# 2     2  2.02
# 3     3  3.03
# 4     4  4.04
# 5     5  5.05
# 6     6  6.06
# 7     7  7.07
# 8     8  8.08
# 9     9  9.09
# 10   10 10.10
# 11   11 11.11
# 12   12 12.12
# 13   13 13.13
# 14   14 14.14
# 15   15 15.15
# 16   16 16.16
# 17   17 17.17
# 18   18 18.18
# 19   19 19.19
# 20   20 20.20

We could make a lagged data set in no time, like so:

new_data <- data.frame(week=data$week[3:20], embed(data$value,3))

names(new_data)[2:4] <- c("this_week", "last_week", "2_weeks_ago")

#    week this_week last_week 2_weeks_ago
# 1     3      3.03      2.02        1.01
# 2     4      4.04      3.03        2.02
# 3     5      5.05      4.04        3.03
# 4     6      6.06      5.05        4.04
# 5     7      7.07      6.06        5.05
# 6     8      8.08      7.07        6.06
# 7     9      9.09      8.08        7.07
# 8    10     10.10      9.09        8.08
# 9    11     11.11     10.10        9.09
# 10   12     12.12     11.11       10.10
# 11   13     13.13     12.12       11.11
# 12   14     14.14     13.13       12.12
# 13   15     15.15     14.14       13.13
# 14   16     16.16     15.15       14.14
# 15   17     17.17     16.16       15.15
# 16   18     18.18     17.17       16.16
# 17   19     19.19     18.18       17.17
# 18   20     20.20     19.19       18.18

Manipulating Data in R

Answers (1)

Related Questions