xyz
xyz

Reputation: 45

Calculation on a vector of values in a R data.table

I have the following data.table in R:

dataset <- data.table(C=c("a", "b", "c") , neg=c("5, 7", "9", "3, 4, 5"), pos = c("5.05, 8", "", "2.95, 4.2"))

the table looks like this:

example data.table

I want to find an overlap between the values in the columns “neg” and “pos”. If the difference between any value in the two columns in the same row is smaller than 0.1 - I want to merge the values by taking the mean of the two values. E.g. for the pair 5 and 5.05 – 5.025 should be calculated. If there is no value within the same 0.1 range just the original value is shown. I added a picture of my idea of a possible result:

possible result table

Is there a function to do this directly or do I have to split/ rearrange the table before?

Thanks for your help!

Upvotes: 0

Views: 117

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269491

In the example in the question there are no values within 0.1 of each other within any input cell and we assume that that is the general case or if not then it is ok to coalesce those too.

For each row scan the numbers into a numeric vector, sort it and find the instances where adjacent numbers are less than 0.1 . Take the mean of those numbers and NA out the prior number. Omit the NA's and convert to a comma separated string.

If the values in C are unique then we could replace the by= with by=C.

dataset[, overlap := {
  s <- sort(scan(text = c(neg, pos), sep = ",", quiet = TRUE))
  wx <- which(c(FALSE, diff(s) < 0.1))
  s[wx] <- (s[wx] + s[wx-1]) / 2
  s[wx-1] <- NA
  toString(na.omit(s))
}, by = 1:nrow(dataset)]

dataset
##    C     neg       pos          overlap
## 1: a    5, 7   5.05, 8      5.025, 7, 8
## 2: b       9                          9
## 3: c 3, 4, 5 2.95, 4.2 2.975, 4, 4.2, 5

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388907

The numbers are stored as character values so first you need to split them on comma, convert them to numeric, sort the data. You can then calculate the difference between consecutive values and combine the two values (by taking their average) if their difference is less than 0.1.

In base R, with Map and tapply you can do -

dataset$overlap <- Map(function(x, y) {
  p <- sort(as.numeric(c(x, y)))
  as.numeric(tapply(p, cumsum(c(TRUE, diff(p) > 0.1)), mean))
}, strsplit(dataset$neg, ',\\s*'), strsplit(dataset$pos, ',\\s*'))

dataset

#   C     neg       pos                 overlap
#1: a    5, 7   5.05, 8       5.025,7.000,8.000
#2: b       9                                 9
#3: c 3, 4, 5 2.95, 4.2 2.975,4.000,4.200,5.000

dataset$overlap

#[[1]]
#[1] 5.025 7.000 8.000

#[[2]]
#[1] 9

#[[3]]
#[1] 2.975 4.000 4.200 5.000

Upvotes: 1

Related Questions