Identify several exact linear relationships within data

Question

I have some 2D data where a large number of the rows obey exactly one of a few linear relationships. It's easy to identify the lines when the data is plotted:

How can identify the slopes and intercepts of these lines?

Although which linear relationship applies is a deterministic process based on another variable, that variable has been lost. I don't care that I won't be able to predict new values; I just want all the slopes and intercepts.

If the intercepts are zero, the algorithm is relatively easy. Simply compute r = y/x for every point, round it to some precision, then identify the most frequent r. However, this won't generalize when the intercepts are nonzero.

Reproducible data:

library(data.table)
div <- function(i, d) {
  {i %% d} == 0L
}

DT <- data.table(x = runif(1e6, 1, 100e3), i = seq_len(1e6))
DT[, y := 0.8 * x + 23333]
DT[div(i, 3), y := 0.3 * x + 14444]
DT[div(i, 7), y := 1.7 * x + 8888]
DT[1:50e3, y := y + runif(.N, -20e3, 20e3)]

One process I've tried to do is to perform a cross-join, calculating the slopes between a sample of points with all other points. In this case, it does identify the slopes; however, this requires only a small minority of points to be off the lines and may be a bit inefficient.

CJ1 <- function(seq., siz = 500) {
  CJ(i1 = seq.,
     i2 = sample.int(1e6, size = siz)) %>%
    .[DT, on = "i1==i", nomatch = 0L] %>%
    .[DT[, .(x1 = x, y1 = y, i2 = i)], on = "i2", nomatch = 0L] %>%
    .[, m := round((y - y1) / (x - x1), 3)] %>%
    .[, .N, keyby = .(m)] %>%
    .[order(-N)] %>%
    .[N > (2 * N[20])]
}

Are there any established modelling techniques to extract such linear relationships?

Identify several exact linear relationships within data

Answers (1)

Related Questions