Reputation: 353

Process rows of R dataframe without loop in memory efficient way

The structure of my dataframe data1, which has over 1.5 million rows, is like this:

data1 <- data.frame(NEW_UPC=c(11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005992,11820005992,11820005992,11820005992,11820005992,11820005992,11820005992,11820005992,11820005992,11820005993,11820005993,11820005993,11820005993,11820005993,11820005993,11820005993,11820005993,11820005993,11820005994,11820005994,11820005994,11820005994,11820005994,11820005994,11820005995,11820005995,11820005995,11820005995,11820005995,11820005995,11820005995,11820005995,11820005995),
                IRI_KEY=c(1073521,1073521,1073521,1073525,1073525,1073525,1078106,1078106,1078106,1078107,1078107,1078107,1073521,1073521,1073521,1073525,1073525,1073525,1078106,1078106,1078106,1073521,1073521,1073521,1073525,1073525,1073525,1078106,1078106,1078106,1073521,1073521,1073525,1073525,1078106,1078106,1073521,1073521,1073521,1073525,1073525,1073525,1078106,1078106,1078106),
                WEEK = c(1229,1230,1232,1218,1224,1229,1282,1285,1287,1229,1230,1232,1229,1230,1232,1218,1224,1229,1282,1285,1287,1229,1230,1232,1217,1221,1227,1270,1272,1273,1273,1274,1270,1272,1217,1221,1229,1230,1232,1218,1224,1229,1282,1285,1287),
                END=c(1232,1232,1232,1229,1229,1229,1287,1287,1287,1232,1232,1232,1232,1232,1232,1229,1229,1229,1287,1287,1287,1232,1232,1232,1227,1227,1227,1273,1273,1273,1274,1274,1272,1272,1221,1221,1232,1232,1232,1229,1229,1229,1287,1287,1287))

I need to insert a column Exit.time using values in columns WEEK and END and a cutoff value, which is 1287. Exit.time should have 0 or 1 value based on the following logic:

if WEEK = 1287, then Exit.time = 0.

if Week not equal to 1287, but WEEK = END then Exit.time = 1, otherwise Exit.time = 0.

For this I tried the following for loop and it does what is required in the above dummy data set.

i=0
for(i in 1:length(data2$NEW_UPC)){
  if (data2$WEEK[i]==1287) {
    data2$Exit.time[i] <- 0
  } else if(data2$WEEK[i]==data2$END[i]) {
    data2$Exit.time[i] <- 1
  } else {
    data2$Exit.time[i] <- 0
  }
}

The problem is that when I use the above loop in my real data set, even after an hour I am not getting an output. I guess looping is not efficient given the size of the dataset. Is there an alternative way to do what I want? I prefer to maintain the order of rows in data1 since I need to do some merge operations later on.

Upvotes: 0

Answers (3)

IceCreamToucan

Reputation: 28695

Since you need Exit.time to be 1 when (WEEK == END) & WEEK != 1287 and 0 otherwise, you can use as.numeric on the results of (WEEK == END) & WEEK != 1287, which changes TRUE to 1 and FALSE to 0.

data1$Exit.time <- with(data1, as.numeric(WEEK != 1287 & WEEK == END))

Upvotes: 4

sm925

Reputation: 2678

Using data.table:

setDT(data1)[, Exit.time := ifelse(WEEK == 1287, 0, ifelse(WEEK != 1287 & WEEK == END, 1, 0))]

Upvotes: 0

Melissa Key

Reputation: 4551

There are multiple ways to code this, mostly differing in the semantics, they are fundamentally doing the same thing

Base R:

data1$Exit.time <- (data1$WEEK != 1287 & data1$WEEK == data1$END)*1

This involves typing data1 a lot, so there is a short-cut:

data1 <- within(data1, {
  Exit.time <- (WEEK != 1287 & WEEK == END)*1
})

Tidyverse: Tidyverse is a suite of packages which are great at manipulating data. We are using the package dplyr, which is part of tidyverse, so you can either load the whole thing, or just dplyr:

library(tidyverse)
data1 <- data1 %>%
   mutate(
     Exit.time = (WEEK != 1287 & WEEK == END)*1
   )

(I convert from TRUE/FALSE to 0/1 by multiplying by 1. It's less to type)

Upvotes: 3

Process rows of R dataframe without loop in memory efficient way

Answers (3)

Related Questions