pandas.merge result is larger than the two separate dataframes

Question

I have to dataframes a, b with datetimeindices and want to merge them, so that all indices are included and where holes occur, nan-values will be.

this worked in the past:

df = pd.merge(train_t1s.iloc[:lols],       sym_train_t1.iloc[:lols],      how='outer', sort=True, left_index=True, right_index=True)

but suddenly It creates a dataframes which ist a lot longer than a & b together (e.g. a's length is 2000 and b's length is 2000 too, but df's length will be 316000000)

a and b each own only one column. this column consists also of datetimevalues in the same format (so maybe this could be the cause)

a:

                                  col_a
2019-07-22 18:18:00 2019-07-23 15:54:00
2019-07-22 18:19:00 2019-07-23 15:54:00
2019-07-22 18:20:00 2019-07-23 15:54:00
2019-07-22 18:21:00 2019-07-23 15:54:00
2019-07-22 18:22:00 2019-07-23 15:54:00
2019-07-22 18:23:00 2019-07-23 15:54:00
2019-07-22 18:24:00 2019-07-23 15:54:00
2019-07-22 18:25:00 2019-07-23 15:54:00
2019-07-22 18:26:00 2019-07-23 15:54:00
2019-07-22 18:27:00 2019-07-23 15:54:00
2019-07-22 18:28:00 2019-07-23 15:54:00
2019-07-22 18:29:00 2019-07-23 15:54:00
2019-07-22 18:30:00 2019-07-23 15:54:00
2019-07-22 18:31:00 2019-07-23 15:54:00
2019-07-22 18:32:00 2019-07-23 15:54:00
2019-07-22 18:33:00 2019-07-23 15:54:00
2019-07-22 18:34:00 2019-07-23 15:54:00
2019-07-22 18:35:00 2019-07-23 15:54:00
2019-07-22 18:36:00 2019-07-23 15:54:00
2019-07-22 18:37:00 2019-07-23 15:54:00
2019-07-22 18:38:00 2019-07-23 15:54:00
2019-07-22 18:39:00 2019-07-23 15:54:00
2019-07-22 18:40:00 2019-07-23 15:54:00
2019-07-22 18:41:00 2019-07-23 15:54:00
2019-07-22 18:42:00 2019-07-23 15:54:00
2019-07-22 18:43:00 2019-07-23 15:54:00
2019-07-22 18:44:00 2019-07-23 15:54:00

b

                                  col_b
2019-07-22 17:40:00 2019-07-23 15:54:00
2019-07-22 17:41:00 2019-07-23 15:54:00
2019-07-22 17:42:00 2019-07-23 15:54:00
2019-07-22 17:43:00 2019-07-23 15:54:00
2019-07-22 17:44:00 2019-07-23 15:54:00
2019-07-22 17:45:00 2019-07-23 15:54:00
2019-07-22 17:46:00 2019-07-23 15:54:00
2019-07-22 17:47:00 2019-07-23 15:54:00
2019-07-22 17:48:00 2019-07-23 15:54:00
2019-07-22 17:49:00 2019-07-23 15:54:00
2019-07-22 17:50:00 2019-07-23 15:54:00
2019-07-22 17:51:00 2019-07-23 15:54:00
2019-07-22 17:52:00 2019-07-23 15:54:00
2019-07-22 17:53:00 2019-07-23 15:54:00
2019-07-22 17:54:00 2019-07-23 15:54:00
2019-07-22 17:55:00 2019-07-23 15:54:00
2019-07-22 17:56:00 2019-07-23 15:54:00
2019-07-22 17:57:00 2019-07-23 15:54:00
2019-07-22 17:58:00 2019-07-23 15:54:00
2019-07-22 17:59:00 2019-07-23 15:54:00
2019-07-22 18:00:00 2019-07-23 15:54:00
2019-07-22 18:01:00 2019-07-23 15:54:00
2019-07-22 18:02:00 2019-07-23 15:54:00
2019-07-22 18:03:00 2019-07-23 15:54:00
2019-07-22 18:04:00 2019-07-23 15:54:00
2019-07-22 18:05:00 2019-07-23 15:54:00
2019-07-22 18:06:00 2019-07-23 15:54:00

Here is a sample of "what I think" my data would look like:



import pandas as pd 
periods_A = 102; periods_B = 57;
ix1= pd.DatetimeIndex(pd.date_range(start="2019-07-21 09:25:00", periods=periods_A, freq="T")) 
df1 = pd.DataFrame({"A":["2019-07-21 09:25:00"]*len(ix1)}, index=ix1)  
ix2=pd.date_range(start="2019-07-20 22:25:00", periods=periods_B, freq="T") 
df2 = pd.DataFrame({"B":["2019-07-20 22:25:00"]*len(ix2)}, index=ix2)  
result = pd.merge(df1,df2,how='outer', sort=True, left_index=True, right_index=True)  
print(len(df1), len(df2), len(result))

Am I missing something out here?

Why pd.concat doesnt work:

import pandas as pd
periods_A = 20; periods_B =20;
ix1= pd.DatetimeIndex(pd.date_range(start="2019-07-21 09:25:00", periods=periods_A, freq="T"))
df1 = pd.DataFrame({"A":["2019-07-21 09:25:00"]*len(ix1)}, index=ix1)

ix2=pd.date_range(start="2019-07-21 09:23:00", periods=periods_B, freq="T")
df2 = pd.DataFrame({"B":["2019-07-21 22:29:00"]*len(ix2)}, index=ix2)

result1 = pd.merge(df1,df2,how='outer', sort=True, left_index=True, right_index=True)

print(len(df1), len(df2), len(result1))
#print(df1, df2)
print(result1)

result2 = pd.concat([df1, df2])
print(len(df1), len(df2), len(result2))
print(result2.sort_index())
print(result1.equals(result2.sort_index()))

# output:
# pd.merge:
20 20 22
                                       A                    B
2019-07-21 09:23:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:24:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:25:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:26:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:27:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:28:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:29:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:30:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:31:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:32:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:33:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:34:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:35:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:36:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:37:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:38:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:39:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:40:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:41:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:42:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:43:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:44:00  2019-07-21 09:25:00                  NaN
20 20 40


#pd.concat
                                       A                    B
2019-07-21 09:23:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:24:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:25:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:25:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:26:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:26:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:27:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:27:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:28:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:28:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:29:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:29:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:30:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:30:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:31:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:31:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:32:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:32:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:33:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:33:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:34:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:34:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:35:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:35:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:36:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:36:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:37:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:37:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:38:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:38:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:39:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:39:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:40:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:40:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:41:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:41:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:42:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:42:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:43:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:44:00  2019-07-21 09:25:00                  NaN
False

Trenton McKinney · Accepted Answer

This answer is a combination of the following:
- Pandas Left Outer Join results in table larger than left table
- Remove rows with duplicate indices (Pandas DataFrame and TimeSeries)
Please review the following:
- pandas User Guide: Merge, join, concatenate and compare
- Pandas Merging 101
The end analysis of the files is, there are 563 non-unique indices, which are associated with 563 non-unique column values, therefore these 563 duplicate indices from both dataframes can be removed.
Once the duplicate data is removed, the dataframes can be joined to produce the expected output.

import pandas as pd

# read files, convert to datetime and set as index
df1 = pd.read_csv('a_.csv', parse_dates=[0, 1], index_col=[0])
df2 = pd.read_csv('b_.csv', parse_dates=[0, 1], index_col=[0])

print(len(df1), len(df2))
[out]:
3519 3519

# check the number of unique values in the index
df1.index.nunique()
df2.index.nunique()

[out]:
2956
2956

Both files have 563 non-unique index values, which will merge into 563**2 rows

Review the duplicate values

# see only the duplicated index and associate column data
dup1 = df1[df1.index.duplicated()]
dup2 = df2[df2.index.duplicated()]

# check if the column values are unique
dup1.nunique()
dup2.nunique()

[out]:
b_col    1
dtype: int64
a_col    1
dtype: int64

Of the values associated with the duplicate indices, there is 1 unique value in the columns, which means all the column values are the same.

Remove duplicates and join

drop the duplicated indices with pandas.Index.duplicated and Boolean Indexing

df1 = df1[~df1.index.duplicated(keep='first')]
df2 = df2[~df2.index.duplicated(keep='first')]

# join the dataframe

result = df1.join(df2, how='outer', sort=True)

print(len(df1), len(df2), len(result))
print(result.shape)
[out]:
2956 2956 2956
(2956, 2)

pandas.merge result is larger than the two separate dataframes

Answers (1)

Review the duplicate values

Remove duplicates and join

Related Questions