Swanny
Swanny

Reputation: 189

Find interval over similar column names in R

I'd like a simpler way to find whether a value is within any of the given intervals. Given the dataframe:

Value  start1  start2  start3  end1  end2  end3
212    82      195     409     97    220   411
80     57      95      111     69    100   130

I'd like to create a new column with a 1 if the 'Value' is in any of the intervals ([start1-end1], [start2-end2], etc) and a 0 if it is not; so in the case above, the first row would have a value of 1 because 212 falls in the second interval and the second row would be a 0. Note that these are the orders of the columns currently and edge cases (value matching the start or end of the interval) should be encoded as a 1.

I could do this with ifelse statements but there are 260 columns and I feel like the solution could be useful in other ways in the future.

Upvotes: 0

Views: 363

Answers (2)

www
www

Reputation: 39154

A solution using tidyverse. The final output is in the InRange column of the dt3 data frame.

# Create example data frame
dt <- read.table(text = "Value  start1  start2  start3  end1  end2  end3
212    82      195     409     97    220   411
                 80     57      95      111     69    100   130",
                 header = TRUE, stringsAsFactors = FALSE)

# Load packages
library(tidyverse)

# Process the data
dt2 <- dt %>% mutate(GroupID = 1:n()) 

dt3 <- dt2 %>%
  gather(StartEnd, Number, -Value, -GroupID) %>%
  mutate(Type = gsub("[0-9]", "", StartEnd),
         ID = gsub("[a-z]", "", StartEnd)) %>%
  select(-StartEnd) %>%
  spread(Type, Number) %>%
  mutate(InRange = ifelse(Value >= start & Value <= end, 1, 0)) %>%
  group_by(GroupID) %>%
  summarise(InRange = max(InRange)) %>%
  right_join(dt2, by = "GroupID")

Notice that I created the GroupID column to make sure duplicated values in the Value column, if any, will not affect the analysis. However, if you are certain that there are no duplicated values, you can use the following code, without creating the GroupID column. dt2 is the final output.

# Process the data
dt2 <- dt %>%
  gather(StartEnd, Number, -Value) %>%
  mutate(Type = gsub("[0-9]", "", StartEnd),
         ID = gsub("[a-z]", "", StartEnd)) %>%
  select(-StartEnd) %>%
  spread(Type, Number) %>%
  mutate(InRange = ifelse(Value >= start & Value <= end, 1, 0)) %>%
  group_by(Value) %>%
  summarise(InRange = max(InRange)) %>%
  right_join(dt, by = "Value") 

Upvotes: 0

sirallen
sirallen

Reputation: 1966

With the data.table package:

library(data.table)

dt = data.table(Value=c(212,80), start1=c(82,57), start2=c(195,95), start3=c(409,111),
                end1=c(97,69), end2=c(220,100), end3=c(411,130))

dt[, rowid:= .I]

It's more natural to do this with a 'melted' table:

dt_melt = melt(dt, id=c('rowid','Value'), measure=patterns('start','end'),
               variable.name='interval', value.name=c('start','end'))

#    rowid Value interval start end
# 1:     1   212        1    82  97
# 2:     2    80        1    57  69
# 3:     1   212        2   195 220
# 4:     2    80        2    95 100
# 5:     1   212        3   409 411
# 6:     2    80        3   111 130

Now we can do the computation and merge with dt on the rowid:

dt[dt_melt[, as.integer(any(between(Value, start, end))), by='rowid'], on='rowid']

#    Value start1 start2 start3 end1 end2 end3 rowid V1
# 1:   212     82    195    409   97  220  411     1  1
# 2:    80     57     95    111   69  100  130     2  0

Upvotes: 1

Related Questions