Reputation: 1484
I know how to simply round the column in pandas
(link), however, my problem is how can I round and do calculation at the same time in pandas.
df['age_new'] = df['age'].apply(lambda x: round(x['age'] * 0.024319744084, 0.000000000001))
TypeError: 'float' object is not subscriptable
Is there any way to do this?
Upvotes: 2
Views: 891
Reputation: 17506
The built-in round
is still competitive, but we need to broadcast the Series
instead of using apply(lambda..
:
import numpy as np
import pandas as pd
# test data
np.random.seed(365)
def setup(N):
df = pd.DataFrame({"age": np.random.randint(110, size=(N))})
return [df]
def attribute_approach(df):
return df.age.mul(0.024319744084).round(5)
def pandas_rounding(df):
return (df["age"] * 0.024319744084).round(5)
def lambda_approach(df):
return df.age.apply(lambda x: round(x * 0.024319744084, 5))
def python_rounding(df):
return round(df["age"] * 0.024319744084, 5)
from performance_measurement import run_performance_comparison
data_size = [
100,200,500,1000,2000,10000,20000,50000,100000
]
approaches = [
lambda_approach,
attribute_approach,
pandas_rounding,
python_rounding,
]
run_performance_comparison(approaches, data_size, setup=setup, number_of_repetitions=20)
Profiling code:
import timeit
from functools import partial
import matplotlib.pyplot as plt
from typing import List, Dict, Callable
from contextlib import contextmanager
import matplotlib.pyplot as plt
import matplotlib.transforms as mtransforms
import matplotlib.ticker as ticker
import numpy as np
@contextmanager
def data_provider(data_size, setup=lambda N: N, teardown=lambda: None):
data = setup(data_size)
yield data
teardown(*data)
def run_performance_comparison(approaches: List[Callable],
data_size: List[int],
*,
setup=lambda N: [N],
teardown=lambda *N: None,
number_of_repetitions=5,
title='Performance Comparison',
data_name='N',
yscale='log',
xscale='log'):
approach_times: Dict[Callable, List[float]] = {approach: [] for approach in approaches}
for N in data_size:
with data_provider(N, setup, teardown) as data:
print(f'Running performance comparison for {data_name}={N}')
for approach in approaches:
function = partial(approach, *data)
approach_time = min(timeit.Timer(function).repeat(repeat=number_of_repetitions, number=1))
approach_times[approach].append(approach_time)
for approach in approaches:
plt.plot(data_size, approach_times[approach], label=approach.__name__)
plt.yscale(yscale)
plt.xscale(xscale)
plt.xlabel(data_name)
plt.ylabel('Execution Time (seconds)')
plt.title(title)
plt.legend()
plt.show()
Upvotes: 0
Reputation: 62383
.apply
is not vectorized.
.apply
on a pandas.Series
, like 'age'
, the lambda
variable, x
is the 'age'
column, so the correct syntax is round(x * 0.0243, 4)
ndigits
parameter of round
, requires an int
, not a float
..mul
, and then .round
.
.apply
.import pandas as pd
import numpy as np
# test data
np.random.seed(365)
df = pd.DataFrame({'age': np.random.randint(110, size=(1000))})
%%timeit
df.age.mul(0.024319744084).round(5)
[out]:
212 µs ± 3.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
(df['age'] * 0.024319744084).round(5)
[out]:
211 µs ± 9.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.age.apply(lambda x: round(x * 0.024319744084, 5))
[out]:
845 µs ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 2
Reputation: 8352
There's two problems:
x['age']
inside the brackets doesn't need ['age']
as you already apply to the column age
(that's why you get the error)round
takes an int
as second argument.Try
df['age_new'] = df['age'].apply(lambda x: round(x * 0.024319744084, 5))
(5
is just an example.)
Upvotes: 2