Peter Chen
Peter Chen

Reputation: 1484

How to round calculations with pandas

I know how to simply round the column in pandas (link), however, my problem is how can I round and do calculation at the same time in pandas.

df['age_new'] = df['age'].apply(lambda x: round(x['age'] * 0.024319744084, 0.000000000001))

TypeError: 'float' object is not subscriptable

Is there any way to do this?

Upvotes: 2

Views: 891

Answers (3)

Sebastian Wozny
Sebastian Wozny

Reputation: 17506

The built-in round is still competitive, but we need to broadcast the Series instead of using apply(lambda..:

import numpy as np
import pandas as pd

# test data
np.random.seed(365)


def setup(N):
    df = pd.DataFrame({"age": np.random.randint(110, size=(N))})
    return [df]


def attribute_approach(df):
    return df.age.mul(0.024319744084).round(5)


def pandas_rounding(df):
    return (df["age"] * 0.024319744084).round(5)


def lambda_approach(df):
    return df.age.apply(lambda x: round(x * 0.024319744084, 5))


def python_rounding(df):
    return round(df["age"] * 0.024319744084, 5)


from performance_measurement import run_performance_comparison

data_size = [
100,200,500,1000,2000,10000,20000,50000,100000

]

approaches = [
    lambda_approach,
    attribute_approach,
    pandas_rounding,
    python_rounding,
]
run_performance_comparison(approaches, data_size, setup=setup, number_of_repetitions=20)

enter image description here enter image description here

Profiling code:

import timeit
from functools import partial

import matplotlib.pyplot as plt
from typing import List, Dict, Callable

from contextlib import contextmanager
import matplotlib.pyplot as plt
import matplotlib.transforms as mtransforms
import matplotlib.ticker as ticker
import numpy as np


@contextmanager
def data_provider(data_size, setup=lambda N: N, teardown=lambda: None):
    data = setup(data_size)
    yield data
    teardown(*data)


def run_performance_comparison(approaches: List[Callable],
                               data_size: List[int],
                               *,
                               setup=lambda N: [N],
                               teardown=lambda *N: None,
                               number_of_repetitions=5,
                               title='Performance Comparison',
                               data_name='N',
                               yscale='log',
                               xscale='log'):
       approach_times: Dict[Callable, List[float]] = {approach: [] for approach in approaches}
    for N in data_size:
        with data_provider(N, setup, teardown) as data:
            print(f'Running performance comparison for {data_name}={N}')
            for approach in approaches:
                function = partial(approach, *data)
                approach_time = min(timeit.Timer(function).repeat(repeat=number_of_repetitions, number=1))
                approach_times[approach].append(approach_time)

    for approach in approaches:
        plt.plot(data_size, approach_times[approach], label=approach.__name__)
    plt.yscale(yscale)
    plt.xscale(xscale)

    plt.xlabel(data_name)
    plt.ylabel('Execution Time (seconds)')
    plt.title(title)
    plt.legend()
    plt.show()

Upvotes: 0

Trenton McKinney
Trenton McKinney

Reputation: 62383

  • .apply is not vectorized.
    • When using .apply on a pandas.Series, like 'age', the lambda variable, x is the 'age' column, so the correct syntax is round(x * 0.0243, 4)
    • The ndigits parameter of round, requires an int, not a float.
  • It is faster to use vectorized methods, like .mul, and then .round.
    • In this case, with 1000 rows, the vectorized method is 4 times faster than using .apply.
import pandas as pd
import numpy as np

# test data
np.random.seed(365)
df = pd.DataFrame({'age': np.random.randint(110, size=(1000))})

%%timeit
df.age.mul(0.024319744084).round(5)
[out]:
212 µs ± 3.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
(df['age'] * 0.024319744084).round(5)
[out]:
211 µs ± 9.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.age.apply(lambda x: round(x * 0.024319744084, 5))
[out]:
845 µs ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 2

fuenfundachtzig
fuenfundachtzig

Reputation: 8352

There's two problems:

  • x['age'] inside the brackets doesn't need ['age'] as you already apply to the column age (that's why you get the error)
  • round takes an int as second argument.

Try

df['age_new'] = df['age'].apply(lambda x: round(x * 0.024319744084, 5))

(5 is just an example.)

Upvotes: 2

Related Questions