Andi
Andi

Reputation: 4899

Extending the Polars API for both DataFrame and LazyFrame

I am extending the polars DataFrame and LazyFrame as described in the docs.

Let's go with their split example for pl.DataFrame. Let's say I also wanted to extend the pl.LazyFrame with the same split function.

The code would look pretty much the same, with the exception of the decorator (@pl.api.register_dataframe_namespace("split") vs. @pl.api.register_lazyframe_namespace("split"), the input argument (df vs. ldf) and the return type (list[pl.DataFrame] vs. list[pl.LazyFrame]).

This looks pretty much violating the DRY mantra.

What is best-practice to extend the API on multiple fronts (DataFrame, LazyFrame, Series)?

To put it differently, how can I apply an extension to both a pl.DataFrame and a pl.LazyFrame? And can this extension share the same namespace?

Upvotes: 2

Views: 251

Answers (1)

Dean MacGregor
Dean MacGregor

Reputation: 18691

A decorator is just a convenient way to do decorate(fun) so you can do something like this:

class SplitFrame:
    def __init__(self, df: pl.DataFrame | pl.LazyFrame):
        if isinstance(df, pl.DataFrame):
            self._df=df.lazy()
            self._was_df=True
        else:
            self._df = df
            self._was_df=False
            
    def by_alternate_rows(self) -> list[pl.DataFrame | pl.LazyFrame]:
        df = self._df.with_row_index(name="n")
        pre_return = [
            df.filter((pl.col("n") % 2) == 0).drop("n"),
            df.filter((pl.col("n") % 2) != 0).drop("n"),
        ]
        if self._was_df is True:
            return pl.collect_all(pre_return)
        else:
            return pre_return
           
pl.api.register_dataframe_namespace("split")(SplitFrame)
pl.api.register_lazyframe_namespace("split")(SplitFrame)

Note that each of those decorators actually return a decorator rather than being a decorator. When you use them in the normal decorator syntax then you don't notice this but in this case it's got the double parenthesis which looks odd.

Now you do can

df=pl.DataFrame({'a':[1,2,3,4]})
df.split.by_alternate_rows()
[shape: (2, 1)
 ┌─────┐
 │ a   │
 │ --- │
 │ i64 │
 ╞═════╡
 │ 1   │
 │ 3   │
 └─────┘,
 shape: (2, 1)
 ┌─────┐
 │ a   │
 │ --- │
 │ i64 │
 ╞═════╡
 │ 2   │
 │ 4   │
 └─────┘]

or

df=pl.LazyFrame({'a':[1,2,3,4]})
df.split.by_alternate_rows()
[<LazyFrame at 0x7F697DFD67B0>, <LazyFrame at 0x7F697DFD61B0>]

Upvotes: 2

Related Questions