Reputation: 2714
I have a scenario where I want to expand a dataframe by merging in another, smaller table into every row of the dataframe.
In other words, if the larger table is 10 rows, and the small table is 2 rows, then the result would be a table of length 20 where every row in the original table is duplicated and the new columns from the smaller table are merged.
To accomplish this, I have written a small function that adds a common column onto each table, merges on that column, then drops the column.
def merge_expand(big, small):
placeholder = "__placeholderstring__"
big.insert(0, placeholder, 1)
small.insert(0, placeholder, 1)
merged = big.merge(small, how='left', on=placeholder)
merged.drop(columns=placeholder, inplace=True)
return merged
# example
big = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
small = pd.DataFrame({'id': ['aa','bb'], 'val':['a','b']})
merge_expand(big, small)
# output:
a b id val
0 1 4 aa a
1 1 4 bb b
2 2 5 aa a
3 2 5 bb b
4 3 6 aa a
5 3 6 bb b
This does the job, but seems to me that it is hacky and may not be the most efficient solution, as it needs to perform multiple DataFrame operations. What would be the most efficient way to handle this problem?
Upvotes: 6
Views: 7436
Reputation: 121
I believe there is a much shorter way. Given data frames df1 and df2, you could do
df = df1.merge(df2, how='cross')
or
df = df2.merge(df1, how='cross')
You could potentially implement a simple if-then-else to figure out which data frame is smaller or larger. But that's besides the merging operation.
Upvotes: 6
Reputation: 3752
It looks like you are looking for a full join / cartesian join. It can be accomplished with pd.merge
if we assign the same key
to all observations.
big.assign(key=1).merge(small.assign(key=1), how='outer', on='key')
Output
a b key id val
0 1 4 1 aa a
1 1 4 1 bb b
2 2 5 1 aa a
3 2 5 1 bb b
4 3 6 1 aa a
5 3 6 1 bb b
If you already have a columns called 'key', you can essentially call it anything:
big['thiswontmatchanything'] = 1
small['thiswontmatchanything'] = 1
big.merge(small, how='outer', on='thiswontmatchanything').drop('thiswontmatchanything', axis=1)
Output
a b id val
0 1 4 aa a
1 1 4 bb b
2 2 5 aa a
3 2 5 bb b
4 3 6 aa a
5 3 6 bb b
Upvotes: 9
Reputation: 2348
Possibly less hacky is the following:
Each dataframe replicates rows by the length of the other orgiinal dataframe The first one is ordered by the 'a' column, but you could adjust that Then the two dataframes are concatenated along hte column axis (1) to achieve the desired result.
def merge_expand(*args):
tmp_big = pd.concat([args[0]] * len(small), ignore_index=True).sort_values(by=['a']).reset_index(drop=True)
tmp_small = pd.concat([args[1]] * len(big), ignore_index=True)
return pd.concat([tmp_big, tmp_small], 1)
Input:
merge_expand(big, small)
Output:
a b id val
0 1 4 aa a
1 1 4 bb b
2 2 5 aa a
3 2 5 bb b
4 3 6 aa a
5 3 6 bb b
EDIT: We can even make it a bit more generic if you want to pass a few arguments:
def merge_expand(*args):
if len(args) == 2:
if len(args[0]) > len(args[1]):
df_1 = pd.concat([args[0]] * len(args[1]), ignore_index=True).sort_values(by=[args[0].columns[0]]).reset_index(drop=True)
df_2 = pd.concat([args[1]] * len(args[0]), ignore_index=True)
return pd.concat([df_1, df_2], 1)
Upvotes: 1