Reputation: 16090
I have some 2D data where a large number of the rows obey exactly one of a few linear relationships. It's easy to identify the lines when the data is plotted:
How can identify the slopes and intercepts of these lines?
Although which linear relationship applies is a deterministic process based on another variable, that variable has been lost. I don't care that I won't be able to predict new values; I just want all the slopes and intercepts.
If the intercepts are zero, the algorithm is relatively easy. Simply compute r = y/x
for every point, round it to some precision, then identify the most frequent r
. However, this won't generalize when the intercepts are nonzero.
Reproducible data:
library(data.table)
div <- function(i, d) {
{i %% d} == 0L
}
DT <- data.table(x = runif(1e6, 1, 100e3), i = seq_len(1e6))
DT[, y := 0.8 * x + 23333]
DT[div(i, 3), y := 0.3 * x + 14444]
DT[div(i, 7), y := 1.7 * x + 8888]
DT[1:50e3, y := y + runif(.N, -20e3, 20e3)]
One process I've tried to do is to perform a cross-join, calculating the slopes between a sample of points with all other points. In this case, it does identify the slopes; however, this requires only a small minority of points to be off the lines and may be a bit inefficient.
CJ1 <- function(seq., siz = 500) {
CJ(i1 = seq.,
i2 = sample.int(1e6, size = siz)) %>%
.[DT, on = "i1==i", nomatch = 0L] %>%
.[DT[, .(x1 = x, y1 = y, i2 = i)], on = "i2", nomatch = 0L] %>%
.[, m := round((y - y1) / (x - x1), 3)] %>%
.[, .N, keyby = .(m)] %>%
.[order(-N)] %>%
.[N > (2 * N[20])]
}
Are there any established modelling techniques to extract such linear relationships?
Upvotes: 3
Views: 143
Reputation: 39154
I reduced the number of rows of your example data frame to 1000 as below because this is more similar to the plot example you provided.
library(data.table)
div <- function(i, d) {
{i %% d} == 0L
}
set.seed(1)
DT <- data.table(x = runif(1e3, 1, 100e3), i = seq_len(1e3))
DT[, y := 0.8 * x + 23333]
DT[div(i, 3), y := 0.3 * x + 14444]
DT[div(i, 7), y := 1.7 * x + 8888]
DT[1:(1e2), y := y + runif(.N, -20e3, 20e3)]
library(ggplot2)
# See the scatter plot of x and y
ggplot(DT, aes(x = x, y = y)) +
geom_point()
As you mentioned, if the line is through the origin. the slope can be calculated as y/x
. We can thus loop through a series of possible intercept values, such as 1:30000
, calculate the y
minus the potential intercept, and then calculate the slope for all rows. If the true intercept shows, there will be one slope becomes abundant in the dataset. We can use a function to find the mode, thus we can determine the intercepts and slopes. Below is a code example.
# A function to find the mode
# From https://stackoverflow.com/a/8189441/7669809
Modes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
# A for loop to loop through a series of intercept and determine the number of mode
answer <- numeric()
for (i in 1:30000){
y_temp <- DT$y - i
r <- y_temp/DT$x
M <- Modes(r)
if (length(M) == 1){
answer[[as.character(i)]] <- M
}
}
The final output is a vector with slopes as the elements and intercepts as the associated names.
answer
# 8888 14444 23333
# 1.7 0.3 0.8
Upvotes: 1