Reputation: 8194
I have two pandas dataframes df1
and df2
and I want their "merged index".
By that I mean the index that is obtained when I do for instance df1.add(df2, fill_value=0).index
(basically, a union of the row names). This kind of computation (here, add
) is performed in separate scripts, and I don't want to compute the "merged index" in these scripts, but I would also like to avoid doing these computations when I'm just interested in the "merged index".
Is there a more "direct" (and hopefully efficient) way of doing this?
My goal is actually to associate "tags" to index elements. I have several pairs of dataframes. Each pair corresponds to one "tag" and may have overlapping indices. Different pairs correspond do different tags, and they are supposed do have no overlapping indices.
Basically, I'm looking for an efficient implementation of an associate_tag
function that would work as follows:
dfA_1
:
idA_1 2 0
idA_2 1 0
idA_3 0 2
dfA_2
:
idA_1 3 2 1
idA_3 2 6 2
idA_4 4 0 2
merge_A = associate_tag((dfA_1, dfA_2), "A")
:
idA_1 A
idA_2 A
idA_3 A
idA_4 A
dfB_1
:
idB_1 2 2 1
idB_2 3 0 0
idB_3 3 1 3
dfB_2
:
idB_1 0
idB_2 3
idB_4 2
merge_B = associate_tag((dfB_1, dfB_2), "B")
:
idB_1 B
idB_2 B
idB_3 B
idB_4 B
total_merge = pd.concat((merge_A, merge_B))
:
idA_1 A
idA_2 A
idA_3 A
idA_4 A
idB_1 B
idB_2 B
idB_3 B
idB_4 B
I know what tag to associate to the index elements of a given pair of dataframes, and the associate_tag
function would ideally completely ignore the numbers in the dataframes.
Here is a non-ideal implementation:
from functools import reduce
from itertools import repeat
def add_dataframes(df1, df2):
return df1.add(df2, fill_value=0)
def sum_dataframes(dfs):
return reduce(add_dataframes, dfs)
def associate_tag(dfs, tag):
return pd.concat((sum_dataframes(dfs).index, repeat(tag)), axis=1)
def associate_tag(dfs, tag):
s = sum_dataframes(dfs)
return pd.DataFrame(list(zip(s.index, repeat(tag)))).set_index(0)
I plan to use this total_merge
to easily add a "tag" column to dataframes containing a mixture of index elements. For instance, I could have:
df
:
idA_2 5 4 1
idB_1 1 0 0
idB_4 2 1 2
idA_4 2 3 2
And I would then use pd.concat((df, total_merge), join="inner", axis=1)
to add an extra column with the tags:
idA_2 5 4 1 A
idB_1 1 0 0 B
idB_4 2 1 2 B
idA_4 2 3 2 A
Is there a better way to do this kind of operation?
Upvotes: 5
Views: 4630
Reputation: 8194
I finally found out that pandas Index
object had an __or__
implementation.
Hopefully the following version of associate_tag
avoids superfluous operations:
from operator import or_ as union
from itertools import repeat
from functools import reduce
def associate_tag(dfs, tag):
idx = reduce(union, (df.index for df in dfs))
return pd.DataFrame(list(zip(idx, repeat(tag)))).set_index(0)
Upvotes: 8
Reputation: 4315
Based on your comment here is an amended solution:
Two parts: Combining your dataframes, depending on your column names, you could just pd.concat your whole list of dataframes once you've made sure the column names line up. So if: dfA_1 is:
col1 col2
index
idA_1 2 0
idA_2 1 0
idA_3 0 2
and dfA_2 is:
col1 col2 col3
index
idA_1 3 2 1
idA_3 2 6 2
idA_4 4 0 2
then
final = pd.concat([dfA_1,dfA_2])
final
col1 col2 col3
index
idA_1 2 0 NaN
idA_2 1 0 NaN
idA_3 0 2 NaN
idA_1 3 2 1.0
idA_3 2 6 2.0
idA_4 4 0 2.0
To fill those NaNs with zeros:
final.fillna(0, inplace=True)
Part 2, the tags: Once you have that creating the tags is as easy as defining a map for the index, you can either write a simple function, hardcode a dict, or use a lambda:
final['tag'] = final.index.map(lambda x: x[2])
final
col1 col2 col3 tag
index
idA_1 2 0 0.0 A
idA_2 1 0 0.0 A
idA_3 0 2 0.0 A
idA_1 3 2 1.0 A
idA_3 2 6 2.0 A
idA_4 4 0 2.0 A
Upvotes: 0