Reputation: 600
I've inherited a database that schedules things in a way I'm not familiar. I've figured out the following:
Monday = 1, Tuesday = 2, Wednesday = 4, Thursday = 8, Friday = 16, Saturday = 32, Sunday = 64
Easy enough. However, if an event is scheduled on Monday, Wednesday and Friday, the field shows 21 (i.e., M + W + F). It seems clever, but I'm stumped trying to figure out how to get back to "English" from this system. Given the number 21, how can I figure out what days an event is scheduled, programatically?
In my head, I'd approach it like this: Find the biggest binary number less than or equal to my number, and subtract it (= first day), then the next biggest, etc. So, given 21, the biggest binary number less is 16 (Friday), which leaves me 5. Next biggest is 4, which is Wednesday, leaving me 1, which is Monday.
Is that approach correct? And if so, I see myself building an exceedingly complicated case_when switch, or maybe a convoluted for-loop, but I feel there's probably a simpler way.
I'm working in a mix of SQL server (to extract the data) and R (to analyze the data), so I could do this in either one. But, even pseudocode would be helpful at this point.
Upvotes: 1
Views: 692
Reputation: 78832
Someone was trying to save space and using bit field encoding in a single byte to store the weekdays. Apparently they wanted to show they were clever or trade CPU cycles for storage.
We can use the intToBits()
function to take the numeric value and convert it to a bit array.
For example:
intToBits(1)
## [1] 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
## [24] 00 00 00 00 00 00 00 00 00
intToBits(4)
## [1] 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
## [24] 00 00 00 00 00 00 00 00 00
intToBits(5)
## [1] 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
## [24] 00 00 00 00 00 00 00 00 00
For some reason the PoweRs That Be™ chose to put things Least Significant Digit first (possibly due to taking LSD). It's also got way too many bits for us since we just need 7.
So, we just need to rearrange and chomp somethings when encoding and decoding:
decode_days <- function(x) {
days <- c("Sunday", "Saturday", "Friday", "Thursday", "Wednesday", "Tuesday", "Monday")
lapply(x, function(y) {
rev(days[as.logical(rev(intToBits(y)[1:7]))])
})
}
encode_days <- function(x) {
c(
"sunday" = 64, "saturday" = 32, "friday" = 16, "thursday" = 8,
"wednesday" = 4, "tuesday" = 2, "monday" = 1
) -> days
sapply(x, function(y) {
y <- unique(tolower(trimws(y)))
y <- y[y %in% names(days)]
sum(days[y])
})
}
Decoding in action:
decode_days(c(1,2,4,8,16,32,64,127,21))
## [[1]]
## [1] "Monday"
##
## [[2]]
## [1] "Tuesday"
##
## [[3]]
## [1] "Wednesday"
##
## [[4]]
## [1] "Thursday"
##
## [[5]]
## [1] "Friday"
##
## [[6]]
## [1] "Saturday"
##
## [[7]]
## [1] "Sunday"
##
## [[8]]
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
## [7] "Sunday"
##
## [[9]]
## [1] "Monday" "Wednesday" "Friday"
Encoding in action:
encode_days(decode_days(c(1,2,4,8,16,32,64,127,21)))
## [1] 1 2 4 8 16 32 64 127 21
The encoder can be optimized a bit but that's an exercise left to the OP since I tried to implement "in order" to make the translation more apparent.
FWIW a lookup table for encoding/decoding (as you suggested) is way faster than this method (just showing partial example of decoding):
list(
"1" = "Monday",
"2" = "Tuesday",
"3" = c("Monday", "Tuesday"),
"4" = "Wednesday",
"5" = c("Monday", "Wednesday"),
"6" = c("Tuesday", "Wednesday"),
"7" = c("Monday", "Tuesday", "Wedneday"),
"8" = "Thursday"
# you can do the rest
) -> decode_lkp
# moved this outside to make it a fair comparison
days_dec <- rev(c("Sunday", "Saturday", "Friday", "Thursday", "Wednesday", "Tuesday", "Monday"))
decode_days <- function(x) { # optimized version
lapply(x, function(y) {
days_dec[as.logical(intToBits(y)[1:7])]
})
}
microbenchmark::microbenchmark(
lookup = unname(decode_lkp[c(1:8)]),
`ƒ()` = decode_days(1:8)
)
## Unit: microseconds
## expr min lq mean median uq max neval
## lookup 1.599 1.7635 2.13525 1.843 1.944 25.302 100
## ƒ() 12.126 12.8310 40.92872 13.084 13.447 2741.986 100
but I figured this would help show the "logic" behind your predecessors attempt at cleverness and the encoding has some bulletproofing in it.
For the "How" w/r/t bits/ints, a byte is 8 bits but they're only using 7 here so we'll stick with 7.
64 32 16 08 04 02 01
If we set all the bits to 0 except for 01
:
64 32 16 08 04 02 01
0 0 0 0 0 0 1
We have that day of week. If we set 04
and 01
we
64 32 16 08 04 02 01
0 0 0 0 1 0 1
We have those two. Wherever there's a 1
we add the header #'s.
In other languages it's possible to use binary operators to test and set the bits. It's kinda possible in R but this is more straightforward for most use cases.
Upvotes: 2
Reputation: 1321
A lookup-ish way :
library(rlist)
decode_days_setup<- function(){
l <- c(1,2,4,8,16,32,64)
l_name <- c("Monday", "Tuesday" ,"Wednesday", "Thursday","Friday", "Saturday","Sunday")
c_sum<- list()
value_list<- list()
for (i in 1:7){
c<-combn(l,i)
c_sum <- list.append(c_sum, colSums(c))
unlist(apply(c, 2, list), recursive =FALSE) -> t
value_list<- list.append(value_list, t)
}
f_list <<- lapply(unlist(value_list, recursive = FALSE), function(e) as.character(factor(e, level=l, labels =l_name)))
c_list <<- unlist(c_sum)
}
decode_days<-function(d){
unlist(f_list[which(c_list==d)])
}
> decode_days(21)
[1] "Monday" "Wednesday" "Friday"
For comparison with the function approach of hrbrmstr and hash method:
days_dec <- rev(c("Sunday", "Saturday", "Friday", "Thursday", "Wednesday", "Tuesday", "Monday"))
decode_days_2 <- function(x) { # optimized version
lapply(x, function(y) {
days_dec[as.logical(intToBits(y)[1:7])]
})
}
library(hashmap)
f_list_c <- unlist(lapply(f_list, function(e) paste(e, collapse = " ")))
H <- hashmap(c_list, f_list_c)
hash<-function(x){
H[[x]]
}
decode_days<- function(d){
f_list[which(c_list==d)]
}
microbenchmark::microbenchmark(
lookup_list = lapply(1:100, decode_days),
lookup_hash = lapply(1:100, hash),
`ƒ()` = lapply(1:100, decode_days_2)
)
Unit: microseconds
expr min lq mean median uq max neval
lookup_list 136.214 146.9980 163.9146 158.0440 165.3305 336.688 100
lookup_hash 1236.040 1304.5370 1386.7976 1373.1710 1444.3965 1900.020 100
ƒ() 267.834 289.7065 353.9536 313.6065 343.5070 3594.135 100
It is surprising that the hash approach is an order of magnitude slower. I think that I am probably not using the hashmap
function correctly.
Upvotes: 0