Unexpected behaviour for numpy/polars correlation given large values

both for polars and numpy, correlation functions seem to break down given very large changes to the location.

I presume that has to do with precision issues, as e.g. a bazillion +1 is viewed as equal to a bazillion +2. Therefore my question is on how to best handle this. First idea seems to de-mean, which will naturally slow down the code, but at least I should avoid the RNG behaviour. What would be the standard approach?

Reproducable example:

import polars as pl 
df =  pl.DataFrame({
    "a": [1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
    "b": [4.0, 3.0, 0.0, 1.0, 2.0, 0.0],
})
(df+1123000000000000000000.0).corr()

# Outputs
#shape: (2, 2)
#┌─────┬─────┐
#│ a   ┆ b   │
#│ --- ┆ --- │
#│ f64 ┆ f64 │
#╞═════╪═════╡
#│ 1.0 ┆ 1.0 │
#│ 1.0 ┆ 1.0 │
#└─────┴─────┘
(df+112300000000000000000.0).corr()

# Outputs
#shape: (2, 2)
#┌─────┬─────┐
#│ a   ┆ b   │
#│ --- ┆ --- │
#│ f64 ┆ f64 │
#╞═════╪═════╡
#│ NaN ┆ NaN │
#│ NaN ┆ NaN │
#└─────┴─────┘

(df+11230000000000000.0).corr()

# Still wrong output
#shape: (2, 2)
#┌───────────┬───────────┐
#│ a         ┆ b         │
#│ ---       ┆ ---       │
#│ f64       ┆ f64       │
#╞═══════════╪═══════════╡
#│ 1.0       ┆ -0.424264 │
#│ -0.424264 ┆ 1.0       │
#└───────────┴───────────┘

(df+1123000000000.0).corr()
# Correct output
# shape: (2, 2)
#┌───────────┬───────────┐
#│ a         ┆ b         │
#│ ---       ┆ ---       │
#│ f64       ┆ f64       │
#╞═══════════╪═══════════╡
#│ 1.0       ┆ -0.684653 │
#│ -0.684653 ┆ 1.0       │
#└───────────┴───────────┘


Upvotes: 1

Views: 33

Answers (1)

etrotta
etrotta

Reputation: 363

With sufficiently large floating point numbers, I wouldn't even call it "viewed as equal", it becomes literally the same number as floats cannot represent the difference between them anymore. There is no way to recover the original number at that point.

For example, df + 1e20 - 1e20 will give you exactly 0.0 for every single row.

The same happens with your "Still wrong output" (1.123e16) example:

>>> df + 1.123e16 - 1.123e16
shape: (6, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 4.0 │
│ 2.0 ┆ 4.0 │
│ 4.0 ┆ 0.0 │
│ 0.0 ┆ 0.0 │
│ 2.0 ┆ 2.0 │
│ 4.0 ┆ 0.0 │
└─────┴─────┘

The only way to preserve that difference would be not using floats, but keep in mind that may significantly impact your performance... That said, the corr method relies on numpy, and numpy does not supports Decimal, so you'll have to first scale using a lossless datatype and then cast to float:

val = pl.lit(1e21, dtype=pl.Decimal)
mean_expr = pl.all().mean().cast(pl.Decimal)
df = df.select(pl.all().cast(pl.Decimal) + val)
df.select(pl.all() - mean_expr).cast(float).corr()

See also print(f'https://{.1+.2}.com')

Upvotes: 3

Related Questions