Efficiently obtaining the union of pandas indices

Question

I have two pandas dataframes df1 and df2 and I want their "merged index".

By that I mean the index that is obtained when I do for instance df1.add(df2, fill_value=0).index (basically, a union of the row names). This kind of computation (here, add) is performed in separate scripts, and I don't want to compute the "merged index" in these scripts, but I would also like to avoid doing these computations when I'm just interested in the "merged index".

Is there a more "direct" (and hopefully efficient) way of doing this?

My goal is actually to associate "tags" to index elements. I have several pairs of dataframes. Each pair corresponds to one "tag" and may have overlapping indices. Different pairs correspond do different tags, and they are supposed do have no overlapping indices.

Basically, I'm looking for an efficient implementation of an associate_tag function that would work as follows:

dfA_1:

idA_1    2    0
idA_2    1    0
idA_3    0    2

dfA_2:

idA_1    3    2    1
idA_3    2    6    2
idA_4    4    0    2

merge_A = associate_tag((dfA_1, dfA_2), "A"):

idA_1    A
idA_2    A
idA_3    A
idA_4    A

dfB_1:

idB_1    2    2    1
idB_2    3    0    0
idB_3    3    1    3

dfB_2:

idB_1    0
idB_2    3
idB_4    2

merge_B = associate_tag((dfB_1, dfB_2), "B"):

idB_1    B
idB_2    B
idB_3    B
idB_4    B

total_merge = pd.concat((merge_A, merge_B)):

idA_1    A
idA_2    A
idA_3    A
idA_4    A
idB_1    B
idB_2    B
idB_3    B
idB_4    B

I know what tag to associate to the index elements of a given pair of dataframes, and the associate_tag function would ideally completely ignore the numbers in the dataframes.

Here is a non-ideal implementation:

from functools import reduce
from itertools import repeat

def add_dataframes(df1, df2):
    return df1.add(df2, fill_value=0)

def sum_dataframes(dfs):
    return reduce(add_dataframes, dfs)

def associate_tag(dfs, tag):
    return pd.concat((sum_dataframes(dfs).index, repeat(tag)), axis=1)

def associate_tag(dfs, tag):   
    s = sum_dataframes(dfs)
    return pd.DataFrame(list(zip(s.index, repeat(tag)))).set_index(0)

I plan to use this total_merge to easily add a "tag" column to dataframes containing a mixture of index elements. For instance, I could have:

df:

idA_2    5    4    1
idB_1    1    0    0
idB_4    2    1    2
idA_4    2    3    2

And I would then use pd.concat((df, total_merge), join="inner", axis=1) to add an extra column with the tags:

idA_2    5    4    1    A
idB_1    1    0    0    B
idB_4    2    1    2    B
idA_4    2    3    2    A

Is there a better way to do this kind of operation?

bli · Accepted Answer

I finally found out that pandas Index object had an __or__ implementation.

Hopefully the following version of associate_tag avoids superfluous operations:

from operator import or_ as union
from itertools import repeat
from functools import reduce

def associate_tag(dfs, tag):   
    idx = reduce(union, (df.index for df in dfs))
    return pd.DataFrame(list(zip(idx, repeat(tag)))).set_index(0)

Efficiently obtaining the union of pandas indices

Answers (2)

Related Questions