teepee
teepee

Reputation: 2714

Merge a copy of one pandas DataFrame into every row of another DataFrame?

I have a scenario where I want to expand a dataframe by merging in another, smaller table into every row of the dataframe.

In other words, if the larger table is 10 rows, and the small table is 2 rows, then the result would be a table of length 20 where every row in the original table is duplicated and the new columns from the smaller table are merged.

To accomplish this, I have written a small function that adds a common column onto each table, merges on that column, then drops the column.

def merge_expand(big, small):
    placeholder = "__placeholderstring__"
    big.insert(0, placeholder, 1)
    small.insert(0, placeholder, 1)
    merged = big.merge(small, how='left', on=placeholder)
    merged.drop(columns=placeholder, inplace=True)
    return merged

# example
big = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
small = pd.DataFrame({'id': ['aa','bb'], 'val':['a','b']})
merge_expand(big, small)

# output:
   a  b  id val
0  1  4  aa   a
1  1  4  bb   b
2  2  5  aa   a
3  2  5  bb   b
4  3  6  aa   a
5  3  6  bb   b

This does the job, but seems to me that it is hacky and may not be the most efficient solution, as it needs to perform multiple DataFrame operations. What would be the most efficient way to handle this problem?

Upvotes: 6

Views: 7436

Answers (3)

GuD
GuD

Reputation: 121

I believe there is a much shorter way. Given data frames df1 and df2, you could do

df = df1.merge(df2, how='cross')

or

df = df2.merge(df1, how='cross')

You could potentially implement a simple if-then-else to figure out which data frame is smaller or larger. But that's besides the merging operation.

Upvotes: 6

rrcal
rrcal

Reputation: 3752

It looks like you are looking for a full join / cartesian join. It can be accomplished with pd.merge if we assign the same key to all observations.

big.assign(key=1).merge(small.assign(key=1), how='outer', on='key')

Output

   a  b  key  id val
0  1  4    1  aa   a
1  1  4    1  bb   b
2  2  5    1  aa   a
3  2  5    1  bb   b
4  3  6    1  aa   a
5  3  6    1  bb   b

If you already have a columns called 'key', you can essentially call it anything:

big['thiswontmatchanything'] = 1
small['thiswontmatchanything'] = 1

big.merge(small, how='outer', on='thiswontmatchanything').drop('thiswontmatchanything', axis=1)

Output

    a   b   id  val
0   1   4   aa  a
1   1   4   bb  b
2   2   5   aa  a
3   2   5   bb  b
4   3   6   aa  a
5   3   6   bb  b

Upvotes: 9

Mark Moretto
Mark Moretto

Reputation: 2348

Possibly less hacky is the following:

Each dataframe replicates rows by the length of the other orgiinal dataframe The first one is ordered by the 'a' column, but you could adjust that Then the two dataframes are concatenated along hte column axis (1) to achieve the desired result.

def merge_expand(*args):
    tmp_big = pd.concat([args[0]] * len(small), ignore_index=True).sort_values(by=['a']).reset_index(drop=True)
    tmp_small = pd.concat([args[1]] * len(big), ignore_index=True)
    return pd.concat([tmp_big, tmp_small], 1)

Input:

merge_expand(big, small)

Output:

   a  b  id val
0  1  4  aa   a
1  1  4  bb   b
2  2  5  aa   a
3  2  5  bb   b
4  3  6  aa   a
5  3  6  bb   b

EDIT: We can even make it a bit more generic if you want to pass a few arguments:

def merge_expand(*args):
    if len(args) == 2:
        if len(args[0]) > len(args[1]):
            df_1 = pd.concat([args[0]] * len(args[1]), ignore_index=True).sort_values(by=[args[0].columns[0]]).reset_index(drop=True)
            df_2 = pd.concat([args[1]] * len(args[0]), ignore_index=True)
            return pd.concat([df_1, df_2], 1)

Upvotes: 1

Related Questions