Reputation: 1182
I very often run into situations where I want to use columns in a pandas dataframe to map from one set of values to another and I'm never sure if I'm using pandas the way I'm supposed to. Given these inputs
import pandas as pd
df = pd.DataFrame({'A':[17, 19, 23], 'B':['x', 'y', 'z'], 'C':[90, 92, 95]})
to_map = pd.Series(['y', 'x', 'z', 'x', 'alpha'], index=[91, 90, 92, 90, 93])
and assuming df
is so large that operations like df.swap_index
could raise a memory error if they copy the whole dataframe, what is the recommended way to perform the following four mappings in pandas? Additionally, if the recommended approach is not the most memory efficient then what is the most efficient way? If a dict
comprehension or other python built-in is more efficient that's excellent, but I don't want solutions that require additional imports.
to_map
values from df
column to df
columndesired_output = to_map.map(pd.Series(df['A'].array, index=df['B']),
na_action='ignore')
print(desired_output)
91 19.0
90 17.0
92 23.0
90 17.0
93 NaN
dtype: float64
to_map
values from df
column to df.index
desired_output = to_map.map(pd.Series(df.index, index=df['B']),
na_action='ignore')
print(desired_output)
91 1.0
90 0.0
92 2.0
90 0.0
93 NaN
dtype: float64
to_map.index
from df
column to df
columndesired_output = pd.Series(to_map.index)
desired_output = desired_output.where(desired_output.isin(df['C']))
desired_output = desired_output.map(pd.Series(df['A'].array, index=df['C']),
na_action='ignore')
print(desired_output)
0 NaN
1 17.0
2 19.0
3 17.0
4 NaN
dtype: float64
to_map.index
from df
column to df.index
desired_output = pd.Series(to_map.index)
desired_output = desired_output.where(desired_output.isin(df['C']))
desired_output = desired_output.map(pd.Series(df.index, index=df['C']),
na_action='ignore')
print(desired_output)
0 NaN
1 0.0
2 1.0
3 0.0
4 NaN
dtype: float64
Upvotes: 1
Views: 434
Reputation: 1182
OK, I got more interested in this so I put together a script to get maximum memory consumption for the approaches in the original post (called series.map
below), @mitoRibo's answer (called reindex
below), and @sammywemmy's comments (called merge
below). It just makes the data in the OP 1e5
times longer and runs the operations a few times to get average values.
If you can get everything into an integer type then series.map
always consumes the least memory, sometimes by a lot.
Otherwise, if you can't work with integers, for the cases I set up
reindex
has the lowest peak memory consumption if you have a short dataframe and a long series of values to map.series.map
has the lowest peak memory consumption if you're working with a long dataframe except when taking the index of a series of values from one dataframe column to another dataframe column. Then reindex
is cheaper.merge
never has the lowest peak memory consumption.merge
is your only choice if you have repeated values in the domain of the map (that would be df.B
in cases 1 and 2, df.C
in cases 3 and 4, or df.A
in cases 5 and 6). Note this is the only choice out of the three discussed here, I haven't thought about other options.Here are the results (code at the end)
peak memory consumption in mb with a long series of values to map
case reindex series.map merge description
0 1 29.018134 29.013939 66.028864 to_map['str'], df.B -> df.A
1 2 29.027581 29.015175 70.035325 to_map['str'], df.B -> df.index
2 3 8.927531 29.516091 49.035328 to_map.index, df.C -> df.A
3 4 8.928258 29.516747 53.039901 to_map.index, df.C -> df.index
4 5 8.428121 8.018872 49.035975 to_map['int'], df.A -> df.B
5 6 8.928532 8.518677 53.039986 to_map['int'], df.A -> df.index
peak memory consumption in mb with a long dataframe for mapping
case reindex series.map merge description
0 1 24.614136 17.412829 19.867535 to_map['str'], df.B -> df.A
1 2 36.859664 17.413827 29.472225 to_map['str'], df.B -> df.index
2 3 13.510243 19.324671 19.870097 to_map.index, df.C -> df.A
3 4 33.859205 21.725148 29.473337 to_map.index, df.C -> df.index
4 5 15.910685 8.470053 19.870748 to_map['int'], df.A -> df.B
5 6 33.859534 10.869827 29.473924 to_map['int'], df.A -> df.index
peak memory consumption in mb with a long series and a long dataframe
case reindex series.map merge description
0 1 36.213309 29.013665 66.023693 to_map['str'], df.B -> df.A
1 2 38.615469 31.414951 79.629001 to_map['str'], df.B -> df.index
2 3 21.769360 29.513805 60.907156 to_map.index, df.C -> df.A
3 4 33.618402 29.725443 70.510802 to_map.index, df.C -> df.index
4 5 23.669874 16.470405 52.024282 to_map['int'], df.A -> df.B
5 6 33.618597 19.370167 61.627128 to_map['int'], df.A -> df.index
size of operands in mb
short_df 0.000318
short_to_map['str'] 0.000254
short_to_map['int'] 0.00014
long_df 34.499973
long_to_map['str'] 25.4
long_to_map['int'] 14.0
Making the operands 10 times bigger (setting num_copies
to 1e6
in the code below) generally makes all the values above 10x larger within about 5%, although when there is some variance merge
tends to use about 5% more than 10x the memory listed above and the other two tend to use about 5% less. The exception is using series.map
with integer values (cases 5 and 6), which uses 20% less than 10x the above value for short series and long dataframes.
I used the following script in a Jupyter Notebook on a Windows 11 machine with an Intel core i7 processor and 16gb memory. The code requires python 3.4 or above, I referred to this SO post for memory profiling, and I copied randomword
from this post.
import random
import string
import sys
import tracemalloc
import pandas as pd
def grab_traced_memory_and_reset(display_text=None):
current, peak = map(lambda x: x / 1e6, tracemalloc.get_traced_memory())
if display_text is not None:
print(display_text + '\n')
print('>>> current mem usage (mb):', current)
print('>>> peak since reset (mb): ', peak)
print('reset peak\n')
tracemalloc.reset_peak()
return current, peak
def run_cases(cases, name, print_values=False):
profile = pd.DataFrame({'case':range(len(cases)), name:0})
baseline = grab_traced_memory_and_reset()[0]
for i in range(len(cases)):
if print_values:
text = cases[i]
else:
text = None
desired_output = eval(cases[i])
current, peak = grab_traced_memory_and_reset(text)
profile.loc[i, name] = peak - baseline
del(desired_output)
return profile
def average_cases(cases, name, num_runs):
result = [run_cases(cases, name) for i in range(num_runs)]
return pd.concat(result).groupby(level=0).mean()
descriptions = ["to_map['str'], df.B -> df.A ",
"to_map['str'], df.B -> df.index",
"to_map.index, df.C -> df.A ",
"to_map.index, df.C -> df.index",
"to_map['int'], df.A -> df.B ",
"to_map['int'], df.A -> df.index"]
def report_results(reindex_r, merge_r, map_r):
results = reindex_r.merge(map_r).merge(merge_r)
results.loc[:, 'case'] = (results['case'] + 1).astype(int)
results['description'] = descriptions
print(results)
def to_map_index_df_col_to_col(to_map):
output = pd.Series(to_map.index)
output = output.where(output.isin(df['C']))
return output.map(pd.Series(df['A'].array, index=df['C']),
na_action='ignore')
def to_map_index_df_col_to_index(to_map):
output = pd.Series(to_map.index)
output = output.where(output.isin(df['C']))
output = output.map(pd.Series(df.index, index=df['C']),
na_action='ignore')
def randomword(length):
letters = string.ascii_lowercase + string.ascii_uppercase
return ''.join(random.choice(letters) for i in range(length))
# number of copies to make data structures bigger
num_copies = int(1e5)
short_df = pd.DataFrame({'A':[17, 19, 23], 'B':['x', 'y', 'z'], 'C':[90, 92, 95]})
long_df = pd.DataFrame({'A':[17, 19, 23] + list(range(24, num_copies*3+21)),
'B':['x', 'y', 'z'] + [randomword(10) for i in range((num_copies-1)*3)],
'C':[90, 92, 95] + list(range(3*num_copies, 6*num_copies-3))})
short_to_map = pd.DataFrame({'str':['y', 'x', 'z', 'x', 'alpha'],
'int':[19, 17, 23, 17, 43]},
index=[91, 90, 92, 90, 93])
long_to_map = pd.concat([short_to_map]*num_copies).reset_index(drop=True)
map_cases = ["to_map['str'].map(pd.Series(df['A'].array, index=df['B']), na_action='ignore')",
"to_map['str'].map(pd.Series(df.index, index=df['B']), na_action='ignore')",
"to_map_index_df_col_to_col(to_map)",
"to_map_index_df_col_to_index(to_map)",
"to_map['int'].map(pd.Series(df['B'].array, index=df['A']), na_action='ignore')",
"to_map['int'].map(pd.Series(df.index, index=df['A']), na_action='ignore')"]
reindex_cases = ["df.set_index('B')['A'].reindex(to_map['str'])",
"df.reset_index().set_index('B')['index'].reindex(to_map['str'])",
"df.set_index('C')['A'].reindex(to_map.index)",
"df.reset_index().set_index('C')['index'].reindex(to_map.index)",
"df.set_index('A')['B'].reindex(to_map['int'])",
"df.reset_index().set_index('A')['index'].reindex(to_map['int'])"]
merge_cases = ["df.merge(to_map['str'].rename('B'), how='right')['A']",
"df.reset_index().merge(to_map['str'].rename('B'), how='right')['index']",
"df.merge(pd.Series(to_map.index).rename('C'), how='right')['A']",
"df.reset_index().merge(pd.Series(to_map.index).rename('C'), how='right')['index']",
"df.merge(to_map['int'].rename('A'), how='right')['A']",
"df.reset_index().merge(to_map['int'].rename('A'), how='right')['index']"]
tracemalloc.start()
# uncomment below to see the results for individual runs
# in a single set of cases
#run_cases(reindex_cases, 'reindex', print_values=True)
#run_cases(merge_cases, 'merge', print_values=True)
#run_cases(map_cases, 'map', print_values=True)
print('peak memory consumption in mb with a long series of values to map')
df = short_df
to_map = long_to_map
reindex_results = average_cases(reindex_cases, 'reindex', 10)
merge_results = average_cases(merge_cases, 'merge', 10)
map_results = average_cases(map_cases, 'series.map', 10)
report_results(reindex_results, merge_results, map_results)
print()
print('peak memory consumption in mb with a long dataframe for mapping')
df = long_df
to_map = short_to_map
reindex_results = average_cases(reindex_cases, 'reindex', 10)
merge_results = average_cases(merge_cases, 'merge', 10)
map_results = average_cases(map_cases, 'series.map', 10)
report_results(reindex_results, merge_results, map_results)
print()
print('peak memory consumption in mb with a long series and a long dataframe')
df = long_df
to_map = long_to_map
reindex_results = average_cases(reindex_cases, 'reindex', 10)
merge_results = average_cases(merge_cases, 'merge', 10)
map_results = average_cases(map_cases, 'series.map', 10)
report_results(reindex_results, merge_results, map_results)
print()
print('size of operands in mb')
print(' short_df ', short_df.applymap(sys.getsizeof).sum().sum() / 1e6)
print("short_to_map['str'] ", short_to_map['str'].apply(sys.getsizeof).sum() / 1e6)
print("short_to_map['int'] ", short_to_map['int'].apply(sys.getsizeof).sum() / 1e6)
print(' long_df ', long_df.applymap(sys.getsizeof).sum().sum() / 1e6)
print(" long_to_map['str'] ", long_to_map['str'].apply(sys.getsizeof).sum() / 1e6)
print(" long_to_map['int'] ", long_to_map['int'].apply(sys.getsizeof).sum() / 1e6)
Upvotes: 2
Reputation: 4548
I would adjust the index of the df
and then reindex
using the to_map
. I'm not sure that these are all the fastest approaches, but they're all vectorized and use core pandas functions. I also think they are pretty readable but you can break them into multiple lines. Curious to know if these are slow/fast for your use cases
#1
print(df.set_index('B')['A'].reindex(to_map))
#2 (maybe slow? not sure)
print(df.reset_index().set_index('B')['index'].reindex(to_map))
#3
print(df.set_index('C')['A'].reindex(to_map.index))
#4
print(df.reset_index().set_index('C')['index'].reindex(to_map.index))
Upvotes: 1