Reputation: 67
I wish to extract "pair of numbers", i.e. numbers in adjacent columns within the same row. Then I want to count the pairs to determine which are most frequent.
As example I created a dataset with 5 columns and 4 rows:
var1 var2 var3 var4 var5
1 2 3 0 11
2 0 3 0 1
3 0 3 1 2
4 1 2 2 11
The most frequent consecutive pairs of number are:
1 -> 2
: 3 times (row 1, var1 -> var2; row 3, var4 -> var5; row 4, var2 -> var3)
3 -> 0
: 3 times (row 1, var3 -> var4; row 2, var3 -> var4; row 3, var1 -> var2)
0 -> 3
: 2 times
I am struggling with the code that identifies the most frequent 'consecutive pair of numbers'?
How can I replace the identified consecutive pair of number with 2 and the others with 0?
Upvotes: 1
Views: 546
Reputation: 67778
A base
alternative.
1. Find and count pairs
Because you only have numerical values, we coerce data to matrix. This will make subsequent calculations considerably faster. Create lag and lead versions (column-wise) of the data, i.e. remove the last column (m[ , -ncol(m)]
) and first column (m[ , -ncol(m)]
) respectively.
Coerce the lag and lead data to 'from' and 'to' vectors, and count pairs (table
). Convert table to matrix. Select first pair with max frequency.
m <- as.matrix(d)
tt <- table(from = as.vector(m[ , -ncol(m)]), to = as.vector(m[ , -1]))
m2 <- cbind(from = as.integer(dimnames(tt)[[1]]),
to = rep(as.integer(dimnames(tt)[[2]]), each = dim(tt)[1]),
freq = as.vector(tt))
m3 <- m2[which.max(m2[ , "freq"]), ]
# from to freq
# 3 0 3
If you want all pairs with maximum frequency, use m2[m2[ , "freq"] == max(m2[ , "freq"]), ]
instead.
2. Replace values of most frequent pair and set rest to zero
Make a copy of the original data. Fill it with zero. Grab the 'from' and 'to' values of the 'max pair'. Get indexes of matches in lag and lead data, which correspond to 'from' columns. rbind
with indexes of 'to' columns. At the indexes, replace zeros with 2.
m_bin <- m
m_bin[] <- 0
ix <- which(m[ , -ncol(m)] == m3["from"] &
m[ , -1] == m3["to"],
arr.ind = TRUE)
m_bin[rbind(ix, cbind(ix[ , "row"], ix[ , "col"] + 1))] <- 2
m_bin
# var1 var2 var3 var4 var5
# [1,] 0 0 2 2 0
# [2,] 0 0 2 2 0
# [3,] 2 2 0 0 0
# [4,] 0 0 0 0 0
3. Benchmark
I use data of a size somewhat similar to that mentioned by OP in comment: a data frame with 10000 rows, 100 columns, and sampling from 100 different values.
I compare the code above (f_m()
) with the zoo
answer (f_zoo()
; functions below). To compare the output, I add dimnames
to the zoo
result.
The result shows that f_m
is considerably faster.
set.seed(1)
nr <- 10000
nc <- 100
d <- as.data.frame(matrix(sample(1:100, nr * nc, replace = TRUE),
nrow = nr, ncol = nc))
res_f_m <- f_m(d)
res_f_zoo <- f_zoo(d)
dimnames(res_f_zoo) <- dimnames(res_f_m)
all.equal(res_f_m, res_f_zoo)
# [1] TRUE
system.time(f_m(d))
# user system elapsed
# 0.12 0.01 0.14
system.time(f_zoo(d))
# user system elapsed
# 61.58 26.72 88.45
f_m <- function(d){
m <- as.matrix(d)
tt <- table(from = as.vector(m[ , -ncol(m)]),
to = as.vector(m[ , -1]))
m2 <- cbind(from = as.integer(dimnames(tt)[[1]]),
to = rep(as.integer(dimnames(tt)[[2]]),
each = dim(tt)[1]),
freq = as.vector(tt))
m3 <- m2[which.max(m2[ , "freq"]), ]
m_bin <- m
m_bin[] <- 0
ix <- which(m[ , -ncol(m)] == m3["from"] &
m[ , -1] == m3["to"],
arr.ind = TRUE)
m_bin[rbind(ix, cbind(ix[ , "row"], ix[ , "col"] + 1))] <- 2
return(m_bin)
}
f_zoo <- function(d){
pairs <- sort(table(c(rollapply(t(d), 2, toString))))
top <- scan(text = names(tail(pairs, 1)), sep = ",", what = 0L, quiet = TRUE)
right <- rollapplyr(unname(t(d)), 2, identical, top, fill = FALSE)
left <- rbind(right[-1, ], FALSE)
t(2 * (left | right))
}
Upvotes: 0
Reputation: 269471
library(zoo)
pairs <- sort(table(c(rollapply(t(DF), 2, toString))))
# all pairs with their frequency
pairs
## 0, 1 0, 11 2, 0 2, 11 2, 2 2, 3 3, 1 4, 1 0, 3 1, 2 3, 0
## 1 1 1 1 1 1 1 1 2 3 3
# same but as data.frame
data.frame(read.table(text = names(pairs), sep = ","), freq = c(pairs))
## V1 V2 freq
## 0, 1 0 1 1
## 0, 11 0 11 1
## ...
## 0, 3 0 3 2
## 1, 2 1 2 3
## 3, 0 3 0 3
# pair with highest frequency - picks one if there are several
tail(pairs, 1)
## 3, 0
## 3
# all pairs with highest frequency
pairs[pairs == max(pairs)]
## 1, 2 3, 0
## 3 3
To replace all 3,0 pairs with 2 and everything else with 0:
top <- scan(text = names(tail(pairs, 1)), sep = ",", what = 0L, quiet = TRUE)
right <- rollapplyr(unname(t(DF)), 2, identical, top, fill = FALSE)
left <- rbind(right[-1, ], FALSE)
t(2 * (left | right))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 2 2 0
## [2,] 0 0 2 2 0
## [3,] 2 2 0 0 0
## [4,] 0 0 0 0 0
The input DF
in reproducible form is:
Lines <- "1 2 3 0 11
2 0 3 0 1
3 0 3 1 2
4 1 2 2 11"
DF <- read.table(text = Lines)
Upvotes: 1