Reputation: 1123
I have a lot of dates
. For each date
, there is a vector v
(with length n
) and a square matrix M
(with dimension n
by n
). v
, M
and n
will vary by date
in terms of both values and lengths/dimensions.
The task is for each date, I want to perform a matrix operation to generate a constant scalar - transpose(v) * M * v
.
The naïve way is to do this through a for loop, but the computation time will be huge given it is sequential.
I am wondering if I can store all information into a single Polars Dataframe
such that I can do something like df.group_by("date").agg(...)
which is parallel and efficient.
To give a concrete example:
date 2020-01-01
, v = [1, 2]
, M = [[1, 2], [3, 4]]
, the indices for v and M are ["a", "b"]
.date 2020-02-01
, v = [1, 2, 3]
, M = [[1, 2, 3], [3, 4, 5], [4, 5, 6]]
, the indices for v and M are ["a", "b", "c"]
.date 2020-03-01
, v = [1, 3, 5]
, M = [[1, 5, 9], [2, 4, 6], [3, 6, 9]]
, the indices for v and M are ["b", "d", "e"]
.Upvotes: 0
Views: 69
Reputation: 18691
polars doesn't have it built in so you won't get parallelization from this but you can store the df in polars
df = pl.DataFrame(
[
pl.Series('date', ['2020-01-01', '2020-02-01', '2020-03-01'], dtype=pl.String),
pl.Series('v', [[1, 2], [1, 2, 3], [1, 3, 5]], dtype=pl.List(pl.Int64)),
pl.Series('M', [[[1, 2], [3, 4]], [[1, 2, 3], [3, 4, 5], [4, 5, 6]], [[1, 5, 9], [2, 4, 6], [3, 6, 9]]], dtype=pl.List(pl.List(pl.Int64))),
]
)
shape: (3, 3)
┌────────────┬───────────┬─────────────────────────────────┐
│ date ┆ v ┆ M │
│ --- ┆ --- ┆ --- │
│ str ┆ list[i64] ┆ list[list[i64]] │
╞════════════╪═══════════╪═════════════════════════════════╡
│ 2020-01-01 ┆ [1, 2] ┆ [[1, 2], [3, 4]] │
│ 2020-02-01 ┆ [1, 2, 3] ┆ [[1, 2, 3], [3, 4, 5], [4, 5, … │
│ 2020-03-01 ┆ [1, 3, 5] ┆ [[1, 5, 9], [2, 4, 6], [3, 6, … │
└────────────┴───────────┴─────────────────────────────────┘
As an aside, I have no idea what you mean when you talk about the indices being "a", "b" etc.
Anyway, you can do
def mm(s):
v=s.struct[0].to_numpy()[0]
M=s.struct[1].to_numpy()[0]
first = np.dot(v.transpose(), M)
second = np.dot(first, v)
return pl.Series(second.reshape(1))
df.group_by("date",maintain_order=True).agg(pl.struct("v","M").map_batches(mm))
shape: (3, 2)
┌────────────┬───────────┐
│ date ┆ v │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞════════════╪═══════════╡
│ 2020-01-01 ┆ [27] │
│ 2020-02-01 ┆ [162] │
│ 2020-03-01 ┆ [523] │
└────────────┴───────────┘
Note that in the mm func, I take the first index of the resultant numpy array because there's a redundant outer layer from it being in a column.
If you don't like that the result is a list type you can just do .explode('v')
since they're all 1 length.
Right now, if you want parallelized matrix algebra you'd have to make a plug-in in rust otherwise you're pinned down by the python GIL.
Upvotes: 1
Reputation: 268
Polars only has support for relatively simple arithmetic operations on lists as of now, you are generally better off using libraries like Numpy or Jax for complex operations with multi-dimensional / deeply nested data.
You could use something like .map_rows(lambda row: np.array(row[0]).T @ row[1] @ row[0], return_dtype=pl.Int32())
to still benefit a little from polars, but it is not going to be extremely efficient
Upvotes: 1