Reputation: 5965
Let's say I have the following pandas DataFrame:
df = pd.DataFrame({'name': ['Johnny', 'Brad'], 'rating': [1.0, 0.9]})
I want to convert the rating
column from a decimal to a percentage as a string (e.g. 1.0
to '100%'
). The following works okay:
def decimal_to_percent_string(row):
return '{}%'.format(row['rating'] * 100)
df['rating'] = df.apply(func=decimal_to_percent_string, axis=1)
This seems very inefficient to me as it applies the function to the entire DataFrame which isn't ideal because my DataFrame is very large. Is there a better way to do this?
Upvotes: 4
Views: 21021
Reputation: 14656
If you just want the DataFrame to display that column as a %, it's better to use a formatter since then the rating
column isn't actually changed, and so you can perform further operations on it.
df.style.format({'rating': '{:.2%}'.format})
Now print(df)
will show:
name rating
0 Johnny 100.00%
1 Brad 90.00%
If you actually need to convert the field to a string (e.g. for ETL purposes), this command is both more idiomatic AND fastest on large and small DataFrames:
df['rating'] = df['rating'].apply('{:.2%}'.format)
Now the rating
column is a string and it displays identically to the above result.
import sys
import timeit
import pandas as pd
print(f"Pandas: {pd.__version__} Python: {sys.version[:5]}\n")
for cur_size in [1, 10, 100, 1000, 10000, 100000, 1000000]:
mysetup = (f"import pandas as pd; df = pd.DataFrame({{"
f"'name': ['Johnny', 'Brad']*{cur_size}, "
f"'rating': [1.0, 0.9]*{cur_size}}}); "
f"ff = '{{:.2f}}%'.format")
cs95 = "df.rating.mul(100).astype(str).add('%')"
michael = "df['rating'].apply(ff)"
speeds = []
for stmt in [cs95, michael]:
speeds.append(timeit.timeit(setup=mysetup, stmt=stmt, number=100))
print(f"Length: {cur_size*2}. {speeds[0]:.2f}s vs {speeds[1]:.2f}s")
Results:
Pandas: 1.4.3 Python: 3.9.7
Length: 2. 0.02s vs 0.01s
Length: 20. 0.02s vs 0.02s
Length: 200. 0.03s vs 0.03s
Length: 2000. 0.09s vs 0.08s
Length: 20000. 0.79s vs 0.65s
Length: 200000. 8.44s vs 6.94s
Length: 2000000. 90.44s vs 73.57s
Conclusion: the apply
method is more idiomatic to pandas and Python, and has significantly better performance for larger dataframes.
Upvotes: 1
Reputation: 61
Try this:
df['rating'] = pd.Series(["{0:.2f}%".format(val*100) for val in df['rating']], index = df.index)
print(df)
The output is:
name rating
0 Johnny 100.00%
1 Brad 90.00%
Upvotes: 0
Reputation: 153460
df['rating'] = df['rating'].mul(100).astype(int).astype(str).add('%')
print(df)
Output:
name rating
0 Johnny 100%
1 Brad 90%
Upvotes: 3
Reputation: 402353
Use pandas' broadcasting operations:
df.rating = (df.rating * 100).astype(str) + '%'
df
name rating
0 Johnny 100.0%
1 Brad 90.0%
Alternatively, using df.mul
and df.add
:
df.rating = df.rating.mul(100).astype(str).add('%')
df
name rating
0 Johnny 100.0%
1 Brad 90.0%
Upvotes: 12