shortorian
shortorian

Reputation: 1182

what is the idiomatic way to make a map between columns in pandas?

I very often run into situations where I want to use columns in a pandas dataframe to map from one set of values to another and I'm never sure if I'm using pandas the way I'm supposed to. Given these inputs

import pandas as pd

df = pd.DataFrame({'A':[17, 19, 23], 'B':['x', 'y', 'z'], 'C':[90, 92, 95]})
to_map = pd.Series(['y', 'x', 'z', 'x', 'alpha'], index=[91, 90, 92, 90, 93])

and assuming df is so large that operations like df.swap_index could raise a memory error if they copy the whole dataframe, what is the recommended way to perform the following four mappings in pandas? Additionally, if the recommended approach is not the most memory efficient then what is the most efficient way? If a dict comprehension or other python built-in is more efficient that's excellent, but I don't want solutions that require additional imports.

1. to_map values from df column to df column

desired_output = to_map.map(pd.Series(df['A'].array, index=df['B']),
                            na_action='ignore')
print(desired_output)
91    19.0
90    17.0
92    23.0
90    17.0
93     NaN
dtype: float64

2. to_map values from df column to df.index

desired_output = to_map.map(pd.Series(df.index, index=df['B']),
                            na_action='ignore')
print(desired_output)
91    1.0
90    0.0
92    2.0
90    0.0
93    NaN
dtype: float64

3. to_map.index from df column to df column

desired_output = pd.Series(to_map.index)
desired_output = desired_output.where(desired_output.isin(df['C']))
desired_output = desired_output.map(pd.Series(df['A'].array, index=df['C']),
                                              na_action='ignore')
print(desired_output)
0     NaN
1    17.0
2    19.0
3    17.0
4     NaN
dtype: float64

4. to_map.index from df column to df.index

desired_output = pd.Series(to_map.index)
desired_output = desired_output.where(desired_output.isin(df['C']))
desired_output = desired_output.map(pd.Series(df.index, index=df['C']),
                                              na_action='ignore')
print(desired_output)
0    NaN
1    0.0
2    1.0
3    0.0
4    NaN
dtype: float64

Upvotes: 1

Views: 434

Answers (2)

shortorian
shortorian

Reputation: 1182

OK, I got more interested in this so I put together a script to get maximum memory consumption for the approaches in the original post (called series.map below), @mitoRibo's answer (called reindex below), and @sammywemmy's comments (called merge below). It just makes the data in the OP 1e5 times longer and runs the operations a few times to get average values.

TL;DR

If you can get everything into an integer type then series.map always consumes the least memory, sometimes by a lot.

Otherwise, if you can't work with integers, for the cases I set up

  • reindex has the lowest peak memory consumption if you have a short dataframe and a long series of values to map.
  • series.map has the lowest peak memory consumption if you're working with a long dataframe except when taking the index of a series of values from one dataframe column to another dataframe column. Then reindex is cheaper.
  • merge never has the lowest peak memory consumption.
  • merge is your only choice if you have repeated values in the domain of the map (that would be df.B in cases 1 and 2, df.C in cases 3 and 4, or df.A in cases 5 and 6). Note this is the only choice out of the three discussed here, I haven't thought about other options.

Details

Here are the results (code at the end)

peak memory consumption in mb with a long series of values to map
   case    reindex  series.map      merge                      description
0     1  29.018134   29.013939  66.028864  to_map['str'], df.B -> df.A    
1     2  29.027581   29.015175  70.035325  to_map['str'], df.B -> df.index
2     3   8.927531   29.516091  49.035328  to_map.index,  df.C -> df.A    
3     4   8.928258   29.516747  53.039901  to_map.index,  df.C -> df.index
4     5   8.428121    8.018872  49.035975  to_map['int'], df.A -> df.B    
5     6   8.928532    8.518677  53.039986  to_map['int'], df.A -> df.index

peak memory consumption in mb with a long dataframe for mapping
   case    reindex  series.map      merge                      description
0     1  24.614136   17.412829  19.867535  to_map['str'], df.B -> df.A    
1     2  36.859664   17.413827  29.472225  to_map['str'], df.B -> df.index
2     3  13.510243   19.324671  19.870097  to_map.index,  df.C -> df.A    
3     4  33.859205   21.725148  29.473337  to_map.index,  df.C -> df.index
4     5  15.910685    8.470053  19.870748  to_map['int'], df.A -> df.B    
5     6  33.859534   10.869827  29.473924  to_map['int'], df.A -> df.index

peak memory consumption in mb with a long series and a long dataframe
   case    reindex  series.map      merge                      description
0     1  36.213309   29.013665  66.023693  to_map['str'], df.B -> df.A    
1     2  38.615469   31.414951  79.629001  to_map['str'], df.B -> df.index
2     3  21.769360   29.513805  60.907156  to_map.index,  df.C -> df.A    
3     4  33.618402   29.725443  70.510802  to_map.index,  df.C -> df.index
4     5  23.669874   16.470405  52.024282  to_map['int'], df.A -> df.B    
5     6  33.618597   19.370167  61.627128  to_map['int'], df.A -> df.index

size of operands in mb
           short_df   0.000318
short_to_map['str']   0.000254
short_to_map['int']   0.00014
            long_df  34.499973
 long_to_map['str']  25.4
 long_to_map['int']  14.0

Making the operands 10 times bigger (setting num_copies to 1e6 in the code below) generally makes all the values above 10x larger within about 5%, although when there is some variance merge tends to use about 5% more than 10x the memory listed above and the other two tend to use about 5% less. The exception is using series.map with integer values (cases 5 and 6), which uses 20% less than 10x the above value for short series and long dataframes.

I used the following script in a Jupyter Notebook on a Windows 11 machine with an Intel core i7 processor and 16gb memory. The code requires python 3.4 or above, I referred to this SO post for memory profiling, and I copied randomword from this post.

import random
import string
import sys
import tracemalloc
import pandas as pd

def grab_traced_memory_and_reset(display_text=None):
    current, peak = map(lambda x: x / 1e6, tracemalloc.get_traced_memory())
    if display_text is not None:
        print(display_text + '\n')
        print('>>> current mem usage (mb):', current)
        print('>>> peak since reset (mb): ', peak)
        print('reset peak\n')
    tracemalloc.reset_peak()
    return current, peak

def run_cases(cases, name, print_values=False):
    
    profile = pd.DataFrame({'case':range(len(cases)), name:0})
    
    baseline = grab_traced_memory_and_reset()[0]

    for i in range(len(cases)):

        if print_values:
            text = cases[i]
        else: 
            text = None
        
        desired_output = eval(cases[i])
        current, peak = grab_traced_memory_and_reset(text)
        profile.loc[i, name] = peak - baseline
        del(desired_output)

    return profile

def average_cases(cases, name, num_runs):
    result = [run_cases(cases, name) for i in range(num_runs)]
    return pd.concat(result).groupby(level=0).mean()

descriptions = ["to_map['str'], df.B -> df.A    ",
                "to_map['str'], df.B -> df.index",
                "to_map.index,  df.C -> df.A    ",
                "to_map.index,  df.C -> df.index",
                "to_map['int'], df.A -> df.B    ",
                "to_map['int'], df.A -> df.index"]

def report_results(reindex_r, merge_r, map_r):
    results = reindex_r.merge(map_r).merge(merge_r)
    results.loc[:, 'case'] = (results['case'] + 1).astype(int)
    results['description'] = descriptions
    print(results) 

def to_map_index_df_col_to_col(to_map):
    output = pd.Series(to_map.index)
    output = output.where(output.isin(df['C']))
    return output.map(pd.Series(df['A'].array, index=df['C']),
                      na_action='ignore')

def to_map_index_df_col_to_index(to_map):
    output = pd.Series(to_map.index)
    output = output.where(output.isin(df['C']))
    output = output.map(pd.Series(df.index, index=df['C']),
                        na_action='ignore')

def randomword(length):
   letters = string.ascii_lowercase + string.ascii_uppercase
   return ''.join(random.choice(letters) for i in range(length))

# number of copies to make data structures bigger
num_copies = int(1e5)

short_df = pd.DataFrame({'A':[17, 19, 23], 'B':['x', 'y', 'z'], 'C':[90, 92, 95]})
long_df = pd.DataFrame({'A':[17, 19, 23] + list(range(24, num_copies*3+21)),
                        'B':['x', 'y', 'z'] + [randomword(10) for i in range((num_copies-1)*3)],
                        'C':[90, 92, 95] + list(range(3*num_copies, 6*num_copies-3))})

short_to_map = pd.DataFrame({'str':['y', 'x', 'z', 'x', 'alpha'],
                             'int':[19, 17, 23, 17, 43]},
                            index=[91, 90, 92, 90, 93])
long_to_map = pd.concat([short_to_map]*num_copies).reset_index(drop=True)


map_cases = ["to_map['str'].map(pd.Series(df['A'].array, index=df['B']), na_action='ignore')",
             "to_map['str'].map(pd.Series(df.index, index=df['B']), na_action='ignore')",
             "to_map_index_df_col_to_col(to_map)",
             "to_map_index_df_col_to_index(to_map)",
             "to_map['int'].map(pd.Series(df['B'].array, index=df['A']), na_action='ignore')",
             "to_map['int'].map(pd.Series(df.index, index=df['A']), na_action='ignore')"]

reindex_cases = ["df.set_index('B')['A'].reindex(to_map['str'])",
                 "df.reset_index().set_index('B')['index'].reindex(to_map['str'])",
                 "df.set_index('C')['A'].reindex(to_map.index)",
                 "df.reset_index().set_index('C')['index'].reindex(to_map.index)",
                 "df.set_index('A')['B'].reindex(to_map['int'])",
                 "df.reset_index().set_index('A')['index'].reindex(to_map['int'])"]

merge_cases = ["df.merge(to_map['str'].rename('B'), how='right')['A']",
               "df.reset_index().merge(to_map['str'].rename('B'), how='right')['index']",
               "df.merge(pd.Series(to_map.index).rename('C'), how='right')['A']",
               "df.reset_index().merge(pd.Series(to_map.index).rename('C'), how='right')['index']",
               "df.merge(to_map['int'].rename('A'), how='right')['A']",
               "df.reset_index().merge(to_map['int'].rename('A'), how='right')['index']"]

tracemalloc.start()

# uncomment below to see the results for individual runs 
# in a single set of cases
#run_cases(reindex_cases, 'reindex', print_values=True)
#run_cases(merge_cases, 'merge', print_values=True)
#run_cases(map_cases, 'map', print_values=True)

print('peak memory consumption in mb with a long series of values to map')
df = short_df
to_map = long_to_map

reindex_results = average_cases(reindex_cases, 'reindex', 10)
merge_results = average_cases(merge_cases, 'merge', 10)
map_results = average_cases(map_cases, 'series.map', 10)
report_results(reindex_results, merge_results, map_results) 
print()

print('peak memory consumption in mb with a long dataframe for mapping')
df = long_df
to_map = short_to_map

reindex_results = average_cases(reindex_cases, 'reindex', 10)
merge_results = average_cases(merge_cases, 'merge', 10)
map_results = average_cases(map_cases, 'series.map', 10)
report_results(reindex_results, merge_results, map_results)
print()

print('peak memory consumption in mb with a long series and a long dataframe')
df = long_df
to_map = long_to_map

reindex_results = average_cases(reindex_cases, 'reindex', 10)
merge_results = average_cases(merge_cases, 'merge', 10)
map_results = average_cases(map_cases, 'series.map', 10)
report_results(reindex_results, merge_results, map_results)
print()

print('size of operands in mb')
print('           short_df  ', short_df.applymap(sys.getsizeof).sum().sum() / 1e6)
print("short_to_map['str']  ", short_to_map['str'].apply(sys.getsizeof).sum() / 1e6)
print("short_to_map['int']  ", short_to_map['int'].apply(sys.getsizeof).sum() / 1e6)
print('            long_df ', long_df.applymap(sys.getsizeof).sum().sum() / 1e6)
print(" long_to_map['str'] ", long_to_map['str'].apply(sys.getsizeof).sum() / 1e6)
print(" long_to_map['int'] ", long_to_map['int'].apply(sys.getsizeof).sum() / 1e6)

Upvotes: 2

mitoRibo
mitoRibo

Reputation: 4548

I would adjust the index of the df and then reindex using the to_map. I'm not sure that these are all the fastest approaches, but they're all vectorized and use core pandas functions. I also think they are pretty readable but you can break them into multiple lines. Curious to know if these are slow/fast for your use cases

#1
print(df.set_index('B')['A'].reindex(to_map))

#2 (maybe slow? not sure)
print(df.reset_index().set_index('B')['index'].reindex(to_map))

#3 
print(df.set_index('C')['A'].reindex(to_map.index))

#4
print(df.reset_index().set_index('C')['index'].reindex(to_map.index))

Upvotes: 1

Related Questions