ojunk
ojunk

Reputation: 899

Pandas: Inconsistent concatenation

I have two pandas DataFrames with a date-time stamp as an index called "datetime" and some floats as a column called "metric1" or "metric2". When I try to concatenate them I get the following error:

ValueError: cannot reindex from a duplicate axis

After reading for many hours I cannot find a solution that fixes my problem (e.g. What does `ValueError: cannot reindex from a duplicate axis` mean? "ValueError: cannot reindex from a duplicate axis" ).

I then spent a long time trying to recreate the problem but couldn't whithout my specific data but my data is two big to post on here.

In the end it seemed like certain parts of the dataframes were incompatible but other parts were fine. After comparing many different slices of indexes I finally found incompatible sets. Can someone please help me understand why I can't concatenate these.

Basically I would like it so that all possible time-stamps are in the index and there is a column for "metric1" and "metric2". If there is no data for a column at a given timestamp then we just have NaN or something. This works fine normally with pd.concat but it is not working in this case. To recreate please use:

CSV Files:

test1.csv

timestamp,metric1
2018-03-21 15:46:36,3.5555559999999997
2018-03-21 15:47:36,5.345001
2018-03-21 15:48:36,5.719998

test2.csv

timestamp,metric2
2018-03-28 05:49:59,3.28
2018-03-28 05:50:59,3.45
2018-03-28 05:51:59,3.258332
2018-03-28 05:52:59,3.068333
2018-03-28 05:53:59,2.9733330000000002
2018-03-28 05:54:59,3.0650009999999996
2018-03-28 05:55:59,3.109999
2018-03-28 05:56:59,3.3683330000000002
2018-03-28 05:57:59,3.1516669999999998
2018-03-28 05:58:59,3.051666
2018-03-28 05:59:59,3.3083339999999994
2018-03-28 06:01:01,3.328333
2018-03-28 06:01:01,3.1
2018-03-28 06:02:00,3.305
2018-03-28 06:03:00,3.29
2018-03-28 06:04:00,3.2183330000000003
2018-03-28 06:05:00,3.176666
2018-03-28 06:06:00,3.353333
2018-03-28 06:07:00,3.3233330000000003
2018-03-28 06:08:00,3.393332
2018-03-28 06:09:00,3.053334
2018-03-28 06:10:00,3.268333
2018-03-28 06:11:00,3.239999
2018-03-28 06:12:00,3.223332
2018-03-28 06:13:00,3.119999

test4.csv

timestamp,metric2
2018-03-21 00:00:00,10.665
2018-03-21 00:01:00,10.285
2018-03-21 00:02:00,10.12834

Note: test2.csv and test4.csv come from EXACTLY the same data set.

Now let's load the CSV files:

tt1 = pd.read_csv('test1.csv', index_col=0)
tt1.index = pd.to_datetime(tt1.index)
tt2 = pd.read_csv('test2.csv', index_col=0)
tt2.index = pd.to_datetime(tt2.index)
tt4 = pd.read_csv('test4.csv', index_col=0)
tt4.index = pd.to_datetime(tt4.index)

Now let's test concatenating them:

Test with no error

tt3 = pd.concat([tt1, tt4], axis = 1)

Test with error

tt3 = pd.concat([tt1, tt2], axis = 1)
ValueError: cannot reindex from a duplicate axis

Upvotes: 1

Views: 626

Answers (3)

BallpointBen
BallpointBen

Reputation: 13750

Don't merge, join. join joins on the indices.

tt1.join(tt2, how='outer')

Upvotes: 1

ricko72
ricko72

Reputation: 24

I have solved your question.

Have a look this solution :)

import pandas as pd

tt1 = pd.read_csv('test1.csv', index_col=0)
tt1.index = pd.to_datetime(tt1.index)
tt2 = pd.read_csv('test2.csv', index_col=0)
tt2.index = pd.to_datetime(tt2.index)
tt4 = pd.read_csv('test4.csv', index_col=0)
tt4.index = pd.to_datetime(tt4.index)

tt3 = pd.concat([tt1, tt4], axis=1)
tt4 = tt3.reset_index().merge(tt2.reset_index(), how='outer')
tt4 = tt4.set_index('timestamp')
print(tt4)

Hope it will make sense

Upvotes: -1

harpan
harpan

Reputation: 8631

You have duplicate index in tt2. That causes the error. Correct way to get the output:

tt1.reset_index().merge(tt2.reset_index(), how='outer')

Upvotes: 2

Related Questions