Demetri Pananos
Demetri Pananos

Reputation: 7404

How can I summarize all columns of a polars dataframe

Pandas makes it easy to summarize columns of a dataframe with an arbitrary function using df.apply(my_func, axis=0).

How can I do the same in polars? Shown below is a MWE. I have a function (just an example, I would like to do this for arbitrary functions) that I can apply to entire columns. The function summarizes columns in pandas using the syntax I've shown.

What is the syntax to perform the same operation in polars?

import polars as pl
import pandas as pd
import numpy as np

# Toy Data
data = {'a':[1, 2, 3, 4, 5], 
        'b': [2, 4, 6, 8, 10]}

# Pandas and polars copy
df = pd.DataFrame(data)
pdf = pl.DataFrame(data)

# Function I want to use to summarize my columns
my_func = lambda x: np.log(x.mean())

# How to do this in pandas
df.apply(my_func, axis=0)

# How do I do the same in polars?

Upvotes: 1

Views: 679

Answers (2)

ritchie46
ritchie46

Reputation: 14690

You really shouldn't use python functions when there are expressions in polars that can achieve your goal.

data = {'a':[1, 2, 3, 4, 5], 
        'b': [2, 4, 6, 8, 10]}

df = pl.DataFrame(data)

df.select(
    pl.all().mean().log()
)

Every map_batches or map_elements is a code smell and should be avoided unless it cannot be done differently.

Context

The idiomatic way to compute anything in polars is using expressions. They should be preferred for a number of reasons:

  • they run parallel
  • they can be optimized
  • they are compiled in rust

A python function is opaque to polars. It can not be optimized because we don't know what it does, nor what the output is.

OP describes it wants to run any arbitrary function. This is included in expressions. Any expression can take a map_batches or map_elements and accept a python function as escape hatch. For this reason answering how you can run an expression on all columns is a superset of answering how you can run a python function on all columns.

Upvotes: 2

Wayoshi
Wayoshi

Reputation: 2893

You can use map_batches:

pdf.select(pl.all().map_batches(my_func))

See the User-defined functions section in the User guide for more details.

Upvotes: 2

Related Questions