drbv
drbv

Reputation: 369

How to get a numbered list renumbering when a value changes

I have 2 lists of numbers (col1 & col2) below. I'd like to add 2 columns (col3 & col4) that do the following. col3 numbers col2 starting at 1 every time col2 changes (e.g. from b2 to b3). col4 is TRUE on the last occurrence for each value in col2.

The data is sorted by col1, then col2 to begin. Note. values in col2 can occur for different values of col1. (i.e. I can have b1 for every value of col 1 (a, b, c))

I can get this working fine for ~5000 rows (~6 sec), but scaling to ~1 million rows it hangs up.

Here is my code

df$col3 <- 0
df$col4 <- FALSE
stopHere <- nrow(df)
c1 <- 'xxx'
c2 <- 'xxx'
for (i in 1:stopHere) {
  if (df[i, "col1"] != c1) {
    c2 <- 0
    c3 <- 1
    c1 <- df[i, "col1"]
  }
  if (df[i, "col2"] != c2) {
    df[i - 1, "col4"] <- TRUE
    c3 <- 1
    c2  <- df[i, "col2"]
  }
  df[i, "col3"] <- c3
  c3  <- c3 + 1
}

This is my desired output.

1     a   b1    1 FALSE
2     a   b1    2 FALSE
3     a   b1    3  TRUE
4     a   b2    1 FALSE
5     a   b2    2  TRUE
6     a   b3    1 FALSE
7     a   b3    2 FALSE
8     a   b3    3 FALSE
9     a   b3    4 FALSE
10    a   b3    5  TRUE
11    b   b1    1 FALSE
12    b   b1    2 FALSE
13    b   b1    3 FALSE
14    b   b1    4  TRUE
15    b   b2    1 FALSE
16    b   b2    2 FALSE
17    b   b2    3 FALSE
18    b   b2    4  TRUE
19    c   b1    1  TRUE
20    c   b2    1 FALSE
21    c   b2    2 FALSE
22    c   b2    3  TRUE
23    c   b3    1 FALSE
24    c   b3    2  TRUE
25    c   b4    1 FALSE
26    c   b4    2 FALSE
27    c   b4    3 FALSE
28    c   b4    4 FALSE

Upvotes: 9

Views: 763

Answers (4)

Tomas
Tomas

Reputation: 59475

This solution doesn't need any loops, nor rle or other clever functions; just mere merge and aggregate functions.

Preparing your data (used Andrie's code) first:

df <- data.frame(
  x = rep(letters[1:3], c(10, 8, 10)),
  y = rep(paste("b", c(1:3, 1:2, 1:4) ,sep=""), c(3, 2, 5, 4, 4, 1, 3, 2, 4))
)

The solution:

minmax <- with(df, merge(
                aggregate(seq(x), by = list(x = x, y = y), min),
                aggregate(seq(x), by = list(x = x, y = y), max)
          ))

names(minmax)[3:4] = c("min", "max") # unique pairs with min/max global order

result <- with(merge(df, minmax), 
    data.frame(x, y, count = seq(x) - min + 1, last = seq(x) == max))

This solution assumes that the input is sorted as you said, but can be easily modified to work on unsorted tables (and keep them unsorted).

Upvotes: 1

John Colby
John Colby

Reputation: 22588

Some example data would be helpful. Nevertheless, this should be a good place to start. With 3 unique values in col1, and 4 in col2, it only takes a second for 10^6 rows:

n = 10^6

col1 = sample(c('a', 'b', 'c'), n, replace=T)
col2 = sample(paste('b', 1:4, sep=''), n, replace=T)

data = data.frame(col1, col2, col3=0, col4=FALSE)
data = data[do.call(order, data), ]

data$col3 = unlist(t(tapply(as.numeric(data$col2), data[,1:2], function(x) 1:length(x))))
data$col4[c(diff(data$col3), -1) < 0] = TRUE

Upvotes: 6

Brian Diggs
Brian Diggs

Reputation: 58825

First, make your starting data reproducible, and make col1 and col2 columns in a dataframe.

dat <- read.table(textConnection(
"a   b1
a   b1
a   b1
a   b2
a   b2
a   b3
a   b3
a   b3
a   b3
a   b3
b   b1
b   b1
b   b1
b   b1
b   b2
b   b2
b   b2
b   b2
c   b1
c   b2
c   b2
c   b2
c   b3
c   b3
c   b4
c   b4
c   b4
c   b4"), stringsAsFactors=FALSE)
names(dat) <- c("col1", "col2")

Run length encoding gives the lengths of your sequences, since everything is starting out sorted.

runs <- rle(dat$col2)

Now manipulate that info. For each element in the length component, create a sequence of that length and put them all together. The indicies of the TRUE values for col4 can be gotten from the cumsum of the lengths.

dat$col3 <- unlist(sapply(runs$lengths, function(l) seq(length.out=l)))
dat$col4 <- FALSE
dat$col4[cumsum(runs$lengths)] <- TRUE

For the result:

> dat
   col1 col2 col3  col4
1     a   b1    1 FALSE
2     a   b1    2 FALSE
3     a   b1    3  TRUE
4     a   b2    1 FALSE
5     a   b2    2  TRUE
6     a   b3    1 FALSE
7     a   b3    2 FALSE
8     a   b3    3 FALSE
9     a   b3    4 FALSE
10    a   b3    5  TRUE
11    b   b1    1 FALSE
12    b   b1    2 FALSE
13    b   b1    3 FALSE
14    b   b1    4  TRUE
15    b   b2    1 FALSE
16    b   b2    2 FALSE
17    b   b2    3 FALSE
18    b   b2    4  TRUE
19    c   b1    1  TRUE
20    c   b2    1 FALSE
21    c   b2    2 FALSE
22    c   b2    3  TRUE
23    c   b3    1 FALSE
24    c   b3    2  TRUE
25    c   b4    1 FALSE
26    c   b4    2 FALSE
27    c   b4    3 FALSE
28    c   b4    4  TRUE

Note that the last line has col4 TRUE, which matches your written description (last of a set is TRUE), but does not match your example output. I don't know which you want.

Upvotes: 3

Andrie
Andrie

Reputation: 179418

Here is a vectorized solution that works for your sample data:

dat <- data.frame(
  V1 = rep(letters[1:3], c(10, 8, 10)),
  V2 = rep(paste("b", c(1:3, 1:2, 1:4) ,sep=""), c(3, 2, 5, 4, 4, 1, 3, 2, 4))
  )

Create columns 3 and 4

zz <- rle(as.character(dat$V2))$lengths
dat$V3 <- sequence(zz)
dat$V4 <- FALSE
dat$V4[head(cumsum(zz), -1)] <- TRUE

The results:

dat
   V1 V2 V3    V4
1   a b1  1 FALSE
2   a b1  2 FALSE
3   a b1  3  TRUE
4   a b2  1 FALSE
5   a b2  2  TRUE
6   a b3  1 FALSE
7   a b3  2 FALSE
8   a b3  3 FALSE
9   a b3  4 FALSE
10  a b3  5  TRUE
11  b b1  1 FALSE
12  b b1  2 FALSE
13  b b1  3 FALSE
14  b b1  4  TRUE
15  b b2  1 FALSE
16  b b2  2 FALSE
17  b b2  3 FALSE
18  b b2  4  TRUE
19  c b1  1  TRUE
20  c b2  1 FALSE
21  c b2  2 FALSE
22  c b2  3  TRUE
23  c b3  1 FALSE
24  c b3  2  TRUE
25  c b4  1 FALSE
26  c b4  2 FALSE
27  c b4  3 FALSE
28  c b4  4 FALSE

Upvotes: 9

Related Questions