Reputation: 211
I have to dataframes a, b
with datetimeindices and want to merge them, so that all indices are included and where holes occur, nan-values will be.
this worked in the past:
df = pd.merge(train_t1s.iloc[:lols], sym_train_t1.iloc[:lols], how='outer', sort=True, left_index=True, right_index=True)
but suddenly It creates a dataframes which ist a lot longer than a
& b
together (e.g. a
's length is 2000 and b
's length is 2000 too, but df
's length will be 316000000)
a
and b
each own only one column. this column consists also of datetimevalues in the same format (so maybe this could be the cause)
a:
col_a
2019-07-22 18:18:00 2019-07-23 15:54:00
2019-07-22 18:19:00 2019-07-23 15:54:00
2019-07-22 18:20:00 2019-07-23 15:54:00
2019-07-22 18:21:00 2019-07-23 15:54:00
2019-07-22 18:22:00 2019-07-23 15:54:00
2019-07-22 18:23:00 2019-07-23 15:54:00
2019-07-22 18:24:00 2019-07-23 15:54:00
2019-07-22 18:25:00 2019-07-23 15:54:00
2019-07-22 18:26:00 2019-07-23 15:54:00
2019-07-22 18:27:00 2019-07-23 15:54:00
2019-07-22 18:28:00 2019-07-23 15:54:00
2019-07-22 18:29:00 2019-07-23 15:54:00
2019-07-22 18:30:00 2019-07-23 15:54:00
2019-07-22 18:31:00 2019-07-23 15:54:00
2019-07-22 18:32:00 2019-07-23 15:54:00
2019-07-22 18:33:00 2019-07-23 15:54:00
2019-07-22 18:34:00 2019-07-23 15:54:00
2019-07-22 18:35:00 2019-07-23 15:54:00
2019-07-22 18:36:00 2019-07-23 15:54:00
2019-07-22 18:37:00 2019-07-23 15:54:00
2019-07-22 18:38:00 2019-07-23 15:54:00
2019-07-22 18:39:00 2019-07-23 15:54:00
2019-07-22 18:40:00 2019-07-23 15:54:00
2019-07-22 18:41:00 2019-07-23 15:54:00
2019-07-22 18:42:00 2019-07-23 15:54:00
2019-07-22 18:43:00 2019-07-23 15:54:00
2019-07-22 18:44:00 2019-07-23 15:54:00
b
col_b
2019-07-22 17:40:00 2019-07-23 15:54:00
2019-07-22 17:41:00 2019-07-23 15:54:00
2019-07-22 17:42:00 2019-07-23 15:54:00
2019-07-22 17:43:00 2019-07-23 15:54:00
2019-07-22 17:44:00 2019-07-23 15:54:00
2019-07-22 17:45:00 2019-07-23 15:54:00
2019-07-22 17:46:00 2019-07-23 15:54:00
2019-07-22 17:47:00 2019-07-23 15:54:00
2019-07-22 17:48:00 2019-07-23 15:54:00
2019-07-22 17:49:00 2019-07-23 15:54:00
2019-07-22 17:50:00 2019-07-23 15:54:00
2019-07-22 17:51:00 2019-07-23 15:54:00
2019-07-22 17:52:00 2019-07-23 15:54:00
2019-07-22 17:53:00 2019-07-23 15:54:00
2019-07-22 17:54:00 2019-07-23 15:54:00
2019-07-22 17:55:00 2019-07-23 15:54:00
2019-07-22 17:56:00 2019-07-23 15:54:00
2019-07-22 17:57:00 2019-07-23 15:54:00
2019-07-22 17:58:00 2019-07-23 15:54:00
2019-07-22 17:59:00 2019-07-23 15:54:00
2019-07-22 18:00:00 2019-07-23 15:54:00
2019-07-22 18:01:00 2019-07-23 15:54:00
2019-07-22 18:02:00 2019-07-23 15:54:00
2019-07-22 18:03:00 2019-07-23 15:54:00
2019-07-22 18:04:00 2019-07-23 15:54:00
2019-07-22 18:05:00 2019-07-23 15:54:00
2019-07-22 18:06:00 2019-07-23 15:54:00
Here is a sample of "what I think" my data would look like:
import pandas as pd
periods_A = 102; periods_B = 57;
ix1= pd.DatetimeIndex(pd.date_range(start="2019-07-21 09:25:00", periods=periods_A, freq="T"))
df1 = pd.DataFrame({"A":["2019-07-21 09:25:00"]*len(ix1)}, index=ix1)
ix2=pd.date_range(start="2019-07-20 22:25:00", periods=periods_B, freq="T")
df2 = pd.DataFrame({"B":["2019-07-20 22:25:00"]*len(ix2)}, index=ix2)
result = pd.merge(df1,df2,how='outer', sort=True, left_index=True, right_index=True)
print(len(df1), len(df2), len(result))
Am I missing something out here?
Why pd.concat doesnt work:
import pandas as pd
periods_A = 20; periods_B =20;
ix1= pd.DatetimeIndex(pd.date_range(start="2019-07-21 09:25:00", periods=periods_A, freq="T"))
df1 = pd.DataFrame({"A":["2019-07-21 09:25:00"]*len(ix1)}, index=ix1)
ix2=pd.date_range(start="2019-07-21 09:23:00", periods=periods_B, freq="T")
df2 = pd.DataFrame({"B":["2019-07-21 22:29:00"]*len(ix2)}, index=ix2)
result1 = pd.merge(df1,df2,how='outer', sort=True, left_index=True, right_index=True)
print(len(df1), len(df2), len(result1))
#print(df1, df2)
print(result1)
result2 = pd.concat([df1, df2])
print(len(df1), len(df2), len(result2))
print(result2.sort_index())
print(result1.equals(result2.sort_index()))
# output:
# pd.merge:
20 20 22
A B
2019-07-21 09:23:00 NaN 2019-07-21 22:29:00
2019-07-21 09:24:00 NaN 2019-07-21 22:29:00
2019-07-21 09:25:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:26:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:27:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:28:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:29:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:30:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:31:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:32:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:33:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:34:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:35:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:36:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:37:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:38:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:39:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:40:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:41:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:42:00 2019-07-21 09:25:00 2019-07-21 22:29:00
2019-07-21 09:43:00 2019-07-21 09:25:00 NaN
2019-07-21 09:44:00 2019-07-21 09:25:00 NaN
20 20 40
#pd.concat
A B
2019-07-21 09:23:00 NaN 2019-07-21 22:29:00
2019-07-21 09:24:00 NaN 2019-07-21 22:29:00
2019-07-21 09:25:00 2019-07-21 09:25:00 NaN
2019-07-21 09:25:00 NaN 2019-07-21 22:29:00
2019-07-21 09:26:00 2019-07-21 09:25:00 NaN
2019-07-21 09:26:00 NaN 2019-07-21 22:29:00
2019-07-21 09:27:00 2019-07-21 09:25:00 NaN
2019-07-21 09:27:00 NaN 2019-07-21 22:29:00
2019-07-21 09:28:00 2019-07-21 09:25:00 NaN
2019-07-21 09:28:00 NaN 2019-07-21 22:29:00
2019-07-21 09:29:00 2019-07-21 09:25:00 NaN
2019-07-21 09:29:00 NaN 2019-07-21 22:29:00
2019-07-21 09:30:00 NaN 2019-07-21 22:29:00
2019-07-21 09:30:00 2019-07-21 09:25:00 NaN
2019-07-21 09:31:00 NaN 2019-07-21 22:29:00
2019-07-21 09:31:00 2019-07-21 09:25:00 NaN
2019-07-21 09:32:00 NaN 2019-07-21 22:29:00
2019-07-21 09:32:00 2019-07-21 09:25:00 NaN
2019-07-21 09:33:00 NaN 2019-07-21 22:29:00
2019-07-21 09:33:00 2019-07-21 09:25:00 NaN
2019-07-21 09:34:00 NaN 2019-07-21 22:29:00
2019-07-21 09:34:00 2019-07-21 09:25:00 NaN
2019-07-21 09:35:00 NaN 2019-07-21 22:29:00
2019-07-21 09:35:00 2019-07-21 09:25:00 NaN
2019-07-21 09:36:00 2019-07-21 09:25:00 NaN
2019-07-21 09:36:00 NaN 2019-07-21 22:29:00
2019-07-21 09:37:00 NaN 2019-07-21 22:29:00
2019-07-21 09:37:00 2019-07-21 09:25:00 NaN
2019-07-21 09:38:00 NaN 2019-07-21 22:29:00
2019-07-21 09:38:00 2019-07-21 09:25:00 NaN
2019-07-21 09:39:00 NaN 2019-07-21 22:29:00
2019-07-21 09:39:00 2019-07-21 09:25:00 NaN
2019-07-21 09:40:00 2019-07-21 09:25:00 NaN
2019-07-21 09:40:00 NaN 2019-07-21 22:29:00
2019-07-21 09:41:00 2019-07-21 09:25:00 NaN
2019-07-21 09:41:00 NaN 2019-07-21 22:29:00
2019-07-21 09:42:00 NaN 2019-07-21 22:29:00
2019-07-21 09:42:00 2019-07-21 09:25:00 NaN
2019-07-21 09:43:00 2019-07-21 09:25:00 NaN
2019-07-21 09:44:00 2019-07-21 09:25:00 NaN
False
Upvotes: 2
Views: 1627
Reputation: 62463
import pandas as pd
# read files, convert to datetime and set as index
df1 = pd.read_csv('a_.csv', parse_dates=[0, 1], index_col=[0])
df2 = pd.read_csv('b_.csv', parse_dates=[0, 1], index_col=[0])
print(len(df1), len(df2))
[out]:
3519 3519
# check the number of unique values in the index
df1.index.nunique()
df2.index.nunique()
[out]:
2956
2956
563**2
rows# see only the duplicated index and associate column data
dup1 = df1[df1.index.duplicated()]
dup2 = df2[df2.index.duplicated()]
# check if the column values are unique
dup1.nunique()
dup2.nunique()
[out]:
b_col 1
dtype: int64
a_col 1
dtype: int64
pandas.Index.duplicated
and Boolean Indexingdf1 = df1[~df1.index.duplicated(keep='first')]
df2 = df2[~df2.index.duplicated(keep='first')]
# join the dataframe
result = df1.join(df2, how='outer', sort=True)
print(len(df1), len(df2), len(result))
print(result.shape)
[out]:
2956 2956 2956
(2956, 2)
Upvotes: 4