Reputation: 45
I have the following data.table in R:
dataset <- data.table(C=c("a", "b", "c") , neg=c("5, 7", "9", "3, 4, 5"), pos = c("5.05, 8", "", "2.95, 4.2"))
the table looks like this:
I want to find an overlap between the values in the columns “neg” and “pos”. If the difference between any value in the two columns in the same row is smaller than 0.1 - I want to merge the values by taking the mean of the two values. E.g. for the pair 5 and 5.05 – 5.025 should be calculated. If there is no value within the same 0.1 range just the original value is shown. I added a picture of my idea of a possible result:
Is there a function to do this directly or do I have to split/ rearrange the table before?
Thanks for your help!
Upvotes: 0
Views: 117
Reputation: 269491
In the example in the question there are no values within 0.1 of each other within any input cell and we assume that that is the general case or if not then it is ok to coalesce those too.
For each row scan the numbers into a numeric vector, sort it and find the instances where adjacent numbers are less than 0.1 . Take the mean of those numbers and NA out the prior number. Omit the NA's and convert to a comma separated string.
If the values in C are unique then we could replace the by= with by=C.
dataset[, overlap := {
s <- sort(scan(text = c(neg, pos), sep = ",", quiet = TRUE))
wx <- which(c(FALSE, diff(s) < 0.1))
s[wx] <- (s[wx] + s[wx-1]) / 2
s[wx-1] <- NA
toString(na.omit(s))
}, by = 1:nrow(dataset)]
dataset
## C neg pos overlap
## 1: a 5, 7 5.05, 8 5.025, 7, 8
## 2: b 9 9
## 3: c 3, 4, 5 2.95, 4.2 2.975, 4, 4.2, 5
Upvotes: 1
Reputation: 388907
The numbers are stored as character values so first you need to split them on comma, convert them to numeric, sort
the data. You can then calculate the difference between consecutive values and combine the two values (by taking their average) if their difference is less than 0.1.
In base R, with Map
and tapply
you can do -
dataset$overlap <- Map(function(x, y) {
p <- sort(as.numeric(c(x, y)))
as.numeric(tapply(p, cumsum(c(TRUE, diff(p) > 0.1)), mean))
}, strsplit(dataset$neg, ',\\s*'), strsplit(dataset$pos, ',\\s*'))
dataset
# C neg pos overlap
#1: a 5, 7 5.05, 8 5.025,7.000,8.000
#2: b 9 9
#3: c 3, 4, 5 2.95, 4.2 2.975,4.000,4.200,5.000
dataset$overlap
#[[1]]
#[1] 5.025 7.000 8.000
#[[2]]
#[1] 9
#[[3]]
#[1] 2.975 4.000 4.200 5.000
Upvotes: 1