bli
bli

Reputation: 8194

Efficiently obtaining the union of pandas indices

I have two pandas dataframes df1 and df2 and I want their "merged index".

By that I mean the index that is obtained when I do for instance df1.add(df2, fill_value=0).index (basically, a union of the row names). This kind of computation (here, add) is performed in separate scripts, and I don't want to compute the "merged index" in these scripts, but I would also like to avoid doing these computations when I'm just interested in the "merged index".

Is there a more "direct" (and hopefully efficient) way of doing this?

My goal is actually to associate "tags" to index elements. I have several pairs of dataframes. Each pair corresponds to one "tag" and may have overlapping indices. Different pairs correspond do different tags, and they are supposed do have no overlapping indices.

Basically, I'm looking for an efficient implementation of an associate_tag function that would work as follows:

dfA_1:

idA_1    2    0
idA_2    1    0
idA_3    0    2

dfA_2:

idA_1    3    2    1
idA_3    2    6    2
idA_4    4    0    2

merge_A = associate_tag((dfA_1, dfA_2), "A"):

idA_1    A
idA_2    A
idA_3    A
idA_4    A

dfB_1:

idB_1    2    2    1
idB_2    3    0    0
idB_3    3    1    3

dfB_2:

idB_1    0
idB_2    3
idB_4    2

merge_B = associate_tag((dfB_1, dfB_2), "B"):

idB_1    B
idB_2    B
idB_3    B
idB_4    B

total_merge = pd.concat((merge_A, merge_B)):

idA_1    A
idA_2    A
idA_3    A
idA_4    A
idB_1    B
idB_2    B
idB_3    B
idB_4    B

I know what tag to associate to the index elements of a given pair of dataframes, and the associate_tag function would ideally completely ignore the numbers in the dataframes.

Here is a non-ideal implementation:

from functools import reduce
from itertools import repeat

def add_dataframes(df1, df2):
    return df1.add(df2, fill_value=0)

def sum_dataframes(dfs):
    return reduce(add_dataframes, dfs)

def associate_tag(dfs, tag):
    return pd.concat((sum_dataframes(dfs).index, repeat(tag)), axis=1)

def associate_tag(dfs, tag):   
    s = sum_dataframes(dfs)
    return pd.DataFrame(list(zip(s.index, repeat(tag)))).set_index(0)

I plan to use this total_merge to easily add a "tag" column to dataframes containing a mixture of index elements. For instance, I could have:

df:

idA_2    5    4    1
idB_1    1    0    0
idB_4    2    1    2
idA_4    2    3    2

And I would then use pd.concat((df, total_merge), join="inner", axis=1) to add an extra column with the tags:

idA_2    5    4    1    A
idB_1    1    0    0    B
idB_4    2    1    2    B
idA_4    2    3    2    A

Is there a better way to do this kind of operation?

Upvotes: 5

Views: 4630

Answers (2)

bli
bli

Reputation: 8194

I finally found out that pandas Index object had an __or__ implementation.

Hopefully the following version of associate_tag avoids superfluous operations:

from operator import or_ as union
from itertools import repeat
from functools import reduce

def associate_tag(dfs, tag):   
    idx = reduce(union, (df.index for df in dfs))
    return pd.DataFrame(list(zip(idx, repeat(tag)))).set_index(0)

Upvotes: 8

tvashtar
tvashtar

Reputation: 4315

Based on your comment here is an amended solution:

Two parts: Combining your dataframes, depending on your column names, you could just pd.concat your whole list of dataframes once you've made sure the column names line up. So if: dfA_1 is:

       col1  col2
index            
idA_1     2     0
idA_2     1     0
idA_3     0     2

and dfA_2 is:

       col1  col2  col3
index                  
idA_1     3     2     1
idA_3     2     6     2
idA_4     4     0     2

then

final = pd.concat([dfA_1,dfA_2])

final
       col1  col2  col3
index                  
idA_1     2     0   NaN
idA_2     1     0   NaN
idA_3     0     2   NaN
idA_1     3     2   1.0
idA_3     2     6   2.0
idA_4     4     0   2.0

To fill those NaNs with zeros:

final.fillna(0, inplace=True)

Part 2, the tags: Once you have that creating the tags is as easy as defining a map for the index, you can either write a simple function, hardcode a dict, or use a lambda:

final['tag'] = final.index.map(lambda x: x[2])

final
       col1  col2  col3 tag
index                      
idA_1     2     0   0.0   A
idA_2     1     0   0.0   A
idA_3     0     2   0.0   A
idA_1     3     2   1.0   A
idA_3     2     6   2.0   A
idA_4     4     0   2.0   A

Upvotes: 0

Related Questions