Unexpected behaviour for numpy/polars correlation given large values

Question

both for polars and numpy, correlation functions seem to break down given very large changes to the location.

I presume that has to do with precision issues, as e.g. a bazillion +1 is viewed as equal to a bazillion +2. Therefore my question is on how to best handle this. First idea seems to de-mean, which will naturally slow down the code, but at least I should avoid the RNG behaviour. What would be the standard approach?

Reproducable example:

import polars as pl 
df =  pl.DataFrame({
    "a": [1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
    "b": [4.0, 3.0, 0.0, 1.0, 2.0, 0.0],
})
(df+1123000000000000000000.0).corr()

# Outputs
#shape: (2, 2)
#┌─────┬─────┐
#│ a   ┆ b   │
#│ --- ┆ --- │
#│ f64 ┆ f64 │
#╞═════╪═════╡
#│ 1.0 ┆ 1.0 │
#│ 1.0 ┆ 1.0 │
#└─────┴─────┘
(df+112300000000000000000.0).corr()

# Outputs
#shape: (2, 2)
#┌─────┬─────┐
#│ a   ┆ b   │
#│ --- ┆ --- │
#│ f64 ┆ f64 │
#╞═════╪═════╡
#│ NaN ┆ NaN │
#│ NaN ┆ NaN │
#└─────┴─────┘

(df+11230000000000000.0).corr()

# Still wrong output
#shape: (2, 2)
#┌───────────┬───────────┐
#│ a         ┆ b         │
#│ ---       ┆ ---       │
#│ f64       ┆ f64       │
#╞═══════════╪═══════════╡
#│ 1.0       ┆ -0.424264 │
#│ -0.424264 ┆ 1.0       │
#└───────────┴───────────┘

(df+1123000000000.0).corr()
# Correct output
# shape: (2, 2)
#┌───────────┬───────────┐
#│ a         ┆ b         │
#│ ---       ┆ ---       │
#│ f64       ┆ f64       │
#╞═══════════╪═══════════╡
#│ 1.0       ┆ -0.684653 │
#│ -0.684653 ┆ 1.0       │
#└───────────┴───────────┘

Unexpected behaviour for numpy/polars correlation given large values

Answers (1)

Related Questions