Cesar
Cesar

Reputation: 617

Merge Dataframes and repeat values

I'm having difficulty coming up with a simple merge logic.

 TimeSeries = A.merge(B, on = ['YEAR','MONTH'], how= 'left')

I have the following dataframe. A continuous month/year column from 2001-2018 (12 months X 17 yrs)

A:

|---------------------|------------------|
|       Month         |         Year     |
|---------------------|------------------|
|          12         |         2001     |
|---------------------|------------------|
|          01         |         2002     |
|---------------------|------------------|
|       ......        |        .....     |
|---------------------|------------------|
|          12         |         2018     |
|---------------------|------------------|

And the other data frame which has a unique ID, month, year and $ amount

B:

 |-----------------|----------------|--------------|--------|
 |       Month     |       Year     |       ID     | Amount |
 |-----------------|----------------|--------------|--------|
 |          05     |       2013     |       2      | 500K   |
 |-----------------|----------------|--------------|--------|
 |          01     |       2002     |       1      | 200K   |
 |-----------------|----------------|--------------|--------|
 |          12     |       2016     |       3      | 800K   |
 |-----------------|----------------|--------------|--------|

How would I be able to append/merge both dataframes so each ID from dataframe B has the entire time series from dataframe A. And it should merge on the correct month and year column.

Desired Output for ID. Each ID should have ~204 rows and if the month and year are same, then display the correct $ amount

 |---------------------|------------------|--------------|-----------|
 |       Month         |         Year     |       ID     |   Amount  |
 |---------------------|------------------|--------------|-----------|
 |          01         |         2001     |       1      |    NaN    |
 |---------------------|------------------|--------------|-----------|
 |          01         |         2002     |       1      |    200K   |
 |---------------------|------------------|--------------|-----------|
 |          12         |         2018     |       1      |    NaN    |
 |---------------------|------------------|--------------|-----------|

Upvotes: 2

Views: 367

Answers (1)

ALollz
ALollz

Reputation: 59519

You want .reindex, that way ID is filled in for each row. There's no need for DataFrameA:

import pandas as pd

names = ['Month', 'Year', 'ID']
idx = pd.MultiIndex.from_product([range(1, 13, 1), range(2001, 2019, 1), B.ID.unique()], 
                                 names=names)

B = B.set_index(names).reindex(idx).reset_index()

Output:

print(B.head())

#   Month  Year  ID Amount
#0      1  2001   2    NaN
#1      1  2001   1    NaN
#2      1  2001   3    NaN
#3      1  2002   2    NaN
#4      1  2002   1   200K

B.groupby('ID').size()
#ID
#1    216
#2    216
#3    216
#dtype: int64

Performant Cartesian Product has ways to optimize the product if it's rather slow

Upvotes: 1

Related Questions