lebesgue
lebesgue

Reputation: 1123

How can matrix be stored efficiently in Polars dataframe and performance matrix operations

I have a lot of dates. For each date, there is a vector v (with length n) and a square matrix M (with dimension n by n). v, M and n will vary by date in terms of both values and lengths/dimensions.

The task is for each date, I want to perform a matrix operation to generate a constant scalar - transpose(v) * M * v.

The naïve way is to do this through a for loop, but the computation time will be huge given it is sequential.

I am wondering if I can store all information into a single Polars Dataframe such that I can do something like df.group_by("date").agg(...) which is parallel and efficient.

To give a concrete example:

enter image description here

Upvotes: 0

Views: 69

Answers (2)

Dean MacGregor
Dean MacGregor

Reputation: 18691

polars doesn't have it built in so you won't get parallelization from this but you can store the df in polars

df = pl.DataFrame(
    [
        pl.Series('date', ['2020-01-01', '2020-02-01', '2020-03-01'], dtype=pl.String),
        pl.Series('v', [[1, 2], [1, 2, 3], [1, 3, 5]], dtype=pl.List(pl.Int64)),
        pl.Series('M', [[[1, 2], [3, 4]], [[1, 2, 3], [3, 4, 5], [4, 5, 6]], [[1, 5, 9], [2, 4, 6], [3, 6, 9]]], dtype=pl.List(pl.List(pl.Int64))),
    ]
)

shape: (3, 3)
┌────────────┬───────────┬─────────────────────────────────┐
│ date       ┆ v         ┆ M                               │
│ ---        ┆ ---       ┆ ---                             │
│ str        ┆ list[i64] ┆ list[list[i64]]                 │
╞════════════╪═══════════╪═════════════════════════════════╡
│ 2020-01-01 ┆ [1, 2]    ┆ [[1, 2], [3, 4]]                │
│ 2020-02-01 ┆ [1, 2, 3] ┆ [[1, 2, 3], [3, 4, 5], [4, 5, … │
│ 2020-03-01 ┆ [1, 3, 5] ┆ [[1, 5, 9], [2, 4, 6], [3, 6, … │
└────────────┴───────────┴─────────────────────────────────┘

As an aside, I have no idea what you mean when you talk about the indices being "a", "b" etc.

Anyway, you can do

def mm(s):
    v=s.struct[0].to_numpy()[0]
    M=s.struct[1].to_numpy()[0]
    first = np.dot(v.transpose(), M)
    second = np.dot(first, v)
    return pl.Series(second.reshape(1))

df.group_by("date",maintain_order=True).agg(pl.struct("v","M").map_batches(mm))
shape: (3, 2)
┌────────────┬───────────┐
│ date       ┆ v         │
│ ---        ┆ ---       │
│ str        ┆ list[i64] │
╞════════════╪═══════════╡
│ 2020-01-01 ┆ [27]      │
│ 2020-02-01 ┆ [162]     │
│ 2020-03-01 ┆ [523]     │
└────────────┴───────────┘

Note that in the mm func, I take the first index of the resultant numpy array because there's a redundant outer layer from it being in a column.

If you don't like that the result is a list type you can just do .explode('v') since they're all 1 length.

Right now, if you want parallelized matrix algebra you'd have to make a plug-in in rust otherwise you're pinned down by the python GIL.

Upvotes: 1

etrotta
etrotta

Reputation: 268

Polars only has support for relatively simple arithmetic operations on lists as of now, you are generally better off using libraries like Numpy or Jax for complex operations with multi-dimensional / deeply nested data.

You could use something like .map_rows(lambda row: np.array(row[0]).T @ row[1] @ row[0], return_dtype=pl.Int32()) to still benefit a little from polars, but it is not going to be extremely efficient

Upvotes: 1

Related Questions