Hugh
Hugh

Reputation: 16090

Identify several exact linear relationships within data

I have some 2D data where a large number of the rows obey exactly one of a few linear relationships. It's easy to identify the lines when the data is plotted:enter image description here

How can identify the slopes and intercepts of these lines?

Although which linear relationship applies is a deterministic process based on another variable, that variable has been lost. I don't care that I won't be able to predict new values; I just want all the slopes and intercepts.

If the intercepts are zero, the algorithm is relatively easy. Simply compute r = y/x for every point, round it to some precision, then identify the most frequent r. However, this won't generalize when the intercepts are nonzero.

Reproducible data:

library(data.table)
div <- function(i, d) {
  {i %% d} == 0L
}

DT <- data.table(x = runif(1e6, 1, 100e3), i = seq_len(1e6))
DT[, y := 0.8 * x + 23333]
DT[div(i, 3), y := 0.3 * x + 14444]
DT[div(i, 7), y := 1.7 * x + 8888]
DT[1:50e3, y := y + runif(.N, -20e3, 20e3)]

One process I've tried to do is to perform a cross-join, calculating the slopes between a sample of points with all other points. In this case, it does identify the slopes; however, this requires only a small minority of points to be off the lines and may be a bit inefficient.

CJ1 <- function(seq., siz = 500) {
  CJ(i1 = seq.,
     i2 = sample.int(1e6, size = siz)) %>%
    .[DT, on = "i1==i", nomatch = 0L] %>%
    .[DT[, .(x1 = x, y1 = y, i2 = i)], on = "i2", nomatch = 0L] %>%
    .[, m := round((y - y1) / (x - x1), 3)] %>%
    .[, .N, keyby = .(m)] %>%
    .[order(-N)] %>%
    .[N > (2 * N[20])]
}

Are there any established modelling techniques to extract such linear relationships?

Upvotes: 3

Views: 143

Answers (1)

www
www

Reputation: 39154

I reduced the number of rows of your example data frame to 1000 as below because this is more similar to the plot example you provided.

library(data.table)

div <- function(i, d) {
  {i %% d} == 0L
}

set.seed(1)

DT <- data.table(x = runif(1e3, 1, 100e3), i = seq_len(1e3))
DT[, y := 0.8 * x + 23333]
DT[div(i, 3), y := 0.3 * x + 14444]
DT[div(i, 7), y := 1.7 * x + 8888]
DT[1:(1e2), y := y + runif(.N, -20e3, 20e3)]

library(ggplot2)

# See the scatter plot of x and y
ggplot(DT, aes(x = x, y = y)) +
  geom_point()

enter image description here

As you mentioned, if the line is through the origin. the slope can be calculated as y/x. We can thus loop through a series of possible intercept values, such as 1:30000, calculate the y minus the potential intercept, and then calculate the slope for all rows. If the true intercept shows, there will be one slope becomes abundant in the dataset. We can use a function to find the mode, thus we can determine the intercepts and slopes. Below is a code example.

# A function to find the mode
# From https://stackoverflow.com/a/8189441/7669809
Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

# A for loop to loop through a series of intercept and determine the number of mode
answer <- numeric()

for (i in 1:30000){
  y_temp <- DT$y - i
  r <- y_temp/DT$x
  M <- Modes(r)
  if (length(M) == 1){
    answer[[as.character(i)]] <- M
  }
}

The final output is a vector with slopes as the elements and intercepts as the associated names.

answer
# 8888 14444 23333 
#  1.7   0.3   0.8

Upvotes: 1

Related Questions