Chris
Chris

Reputation: 1237

Identify the difference between rows in data.table and create new column saying what the differences are

I have a dataset that looks likes this:

Data01 <- data.table(
  code=c("A111", "A111","A111","A111","A111", "A111","A111","A234", "A234","A234","A234","A234", "A234","A234"),
  x=c("",126,126,"",836,843,843,126,126,"",127,836,843,843), 
  y=c("",76,76,"",456,465,465,76,76,"",77,456,465,465),
  no1=c(028756, 028756,028756,057756, 057756, 057756, 057756,028756, 028756,057756,057756, 057756, 057756, 057756),
  no2=c("","",034756,"","","",789165,"",034756,"","","","",789165)
)

Data01[, version := paste0("V", 1:.N), by = code]
Data01[, unique_version := paste(code, version, sep = "_")]

What I would like is a way of adding a column that for each unique code entry says what the difference between each row and the previous one is (i.e. pasting the column name(s) where there are now different values). Something like this:

Data01[, change := c("First_entry","New_x_and_y","New_no2","New_x_and_y_and_no_1","New_x_and_y","New_x_and_y","New_no2","First_entry","New_no2","New_x_and_y_and_no1","New_x_and_y","New_x_and_y","New_x_and_y","New_no2")]

My actual dataset has 5.5 million rows and around 2.6 million unique code entries so I would imagine any solution to this would take some time to complete. It would therefore really helpful to include some sort of progress indicator (like something suggested here: Progress bar in data.table aggregate action) if possible.

Upvotes: 1

Views: 159

Answers (1)

chinsoon12
chinsoon12

Reputation: 25225

You can try something like this

nm <- c("x","y","no1","no2") #names(Data01)[-1L]
Data01[, change := c("First_entry", 
        sapply(seq_len(.N)[-1L], function(n) {
            paste(c("New", 
                nm[which(unlist(.SD[n-1L]) != unlist(.SD[n]))]), 
                collapse="_")
        })), 
    by=.(code)]

Upvotes: 1

Related Questions