Reputation: 1419
I am trying to concat the following dataframes:
df1
price side timestamp
timestamp
2016-01-04 00:01:15.631331072 0.7286 2 1451865675631331
2016-01-04 00:01:15.631399936 0.7286 2 1451865675631400
2016-01-04 00:01:15.631860992 0.7286 2 1451865675631861
2016-01-04 00:01:15.631866112 0.7286 2 1451865675631866
and:
df2
bid bid_size offer offer_size
timestamp
2016-01-04 00:00:31.331441920 0.7284 4000000 0.7285 1000000
2016-01-04 00:00:53.631324928 0.7284 4000000 0.7290 4000000
2016-01-04 00:01:03.131234048 0.7284 5000000 0.7286 4000000
2016-01-04 00:01:12.131444992 0.7285 1000000 0.7286 4000000
2016-01-04 00:01:15.631364096 0.7285 4000000 0.7290 4000000
With
data = pd.concat([df1,df2], axis=1)
But I get the follwing output:
InvalidIndexError Traceback (most recent call last)
<ipython-input-38-2e88458f01d7> in <module>()
----> 1 data = pd.concat([df1,df2], axis=1)
2 data = data.fillna(method='pad')
3 data = data.fillna(method='bfill')
4 data['timestamp'] = data.index.values#converting to datetime
5 data['timestamp'] = pd.to_datetime(data['timestamp'])#converting to datetime
/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
810 keys=keys, levels=levels, names=names,
811 verify_integrity=verify_integrity,
--> 812 copy=copy)
813 return op.get_result()
814
/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
947 self.copy = copy
948
--> 949 self.new_axes = self._get_new_axes()
950
951 def get_result(self):
/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_new_axes(self)
1013 if i == self.axis:
1014 continue
-> 1015 new_axes[i] = self._get_comb_axis(i)
1016 else:
1017 if len(self.join_axes) != ndim - 1:
/usr/local/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_comb_axis(self, i)
1039 raise TypeError("Cannot concatenate list of %s" % types)
1040
-> 1041 return _get_combined_index(all_indexes, intersect=self.intersect)
1042
1043 def _get_concat_axis(self):
/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in _get_combined_index(indexes, intersect)
6120 index = index.intersection(other)
6121 return index
-> 6122 union = _union_indexes(indexes)
6123 return _ensure_index(union)
6124
/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in _union_indexes(indexes)
6149
6150 if hasattr(result, 'union_many'):
-> 6151 return result.union_many(indexes[1:])
6152 else:
6153 for other in indexes[1:]:
/usr/local/lib/python2.7/site-packages/pandas/tseries/index.pyc in union_many(self, others)
959 else:
960 tz = this.tz
--> 961 this = Index.union(this, other)
962 if isinstance(this, DatetimeIndex):
963 this.tz = tz
/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in union(self, other)
1553 result.extend([x for x in other._values if x not in value_set])
1554 else:
-> 1555 indexer = self.get_indexer(other)
1556 indexer, = (indexer == -1).nonzero()
1557
/usr/local/lib/python2.7/site-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
1890
1891 if not self.is_unique:
-> 1892 raise InvalidIndexError('Reindexing only valid with uniquely'
1893 ' valued Index objects')
1894
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I have removed additional columns and removed duplicates and NA where there could be a conflict - but I simply do not know what's wrong.
Upvotes: 127
Views: 392540
Reputation: 55
df1 = pd.DataFrame.from_dict({'A':[1,2,3], 'B':[110,120,230]})
df2 = pd.DataFrame.from_dict({'B':[11,12,23], 'C':[10,11,12]})
df2.rename(columns={'C':'B'}, inplace=True)
pd.concat([df1,df2])
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
The above code will replicate the error when concatenating rows of dataframe. The reason the error occurs is because, there are two columns in df2 with the same name 'B'
A B
0 1 110
1 2 120
2 3 230
B B
0 11 10
1 12 11
2 23 12
Upvotes: 0
Reputation: 1
Happened the same with me, but then I noticed that my index (datetime) has different last dates. When I fixed the date using the same timeframe interval for both dataframes the pd.concat()
worked fine.
Upvotes: 0
Reputation: 2822
When pandas is performing concat operation horizontally (axis=1) it tries to find rows with same indexes and join them horizontally. So row with index 1 from df1 will be matched with row with index 1. Therefore reset index in either one of them or both if both have troubled indexes.
pd.concat([df1.reset_index(drop=True), df2], axis=1)
or
pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True), axis=1])
Upvotes: 4
Reputation: 21
below solution would work if you are concat is using axis=0, which means you want to append rows not columns
reason : one or both of your dataframes might have duplicate columns df1 columns could be A, B, C , C df2 columns could be B, D
in this case A has duplicate column C, and for this reason you might get this error. Drop one of the C column in df1 and hopefully issue will be resolved
df1['C'].is_unique
Upvotes: 0
Reputation: 23
The problem for me was duplicate column labels, just as many others here mentioned it. To keep only the first column for duplicates I used below:
df=df.T[~df.T.index.duplicated(keep='first')].T
Upvotes: 2
Reputation: 935
This is because you have duplicated columns. Before concatenating drop duplicated columns in each DataFrame as follows:
df = df.loc[:,~df.columns.duplicated()].reset_index(drop=True)
Upvotes: 5
Reputation: 2699
Answers here helped but concat worked fine for me in some cases even where duplicate columns were present. However, in some cases it didn't work and raised the InvalidIndexError
.
It turned out that it works fine if order of duplicate columns is same but raises an error if order of duplicate columns is different.
Example where it works fine:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [5, 6, 7], 'c': [9, 10, 11]})
df1 = pd.DataFrame({'a': [12], 'b': [13], 'c': [14]})
df.rename(columns={
'c': 'b'
}, inplace=True)
df1.rename(columns={
'c': 'b'
}, inplace=True)
print(pd.concat([df, df1]))
Output:
a b b
0 1 5 9
1 2 6 10
2 3 7 11
0 12 13 14
Example where it doesn't work:
df = pd.DataFrame({'b': [1, 2, 3], 'a': [5, 6, 7], 'c': [9, 10, 11]})
df1 = pd.DataFrame({'a': [12], 'b': [13], 'c': [14]})
df.rename(columns={
'c': 'b'
}, inplace=True)
df1.rename(columns={
'c': 'b'
}, inplace=True)
print(pd.concat([df, df1]))
Output:
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely
valued Index objects
Upvotes: 1
Reputation: 21
This happened to me when I was trying to concat two dataframes that have duplicated column names!
Let's say that I want to remove the first duplicated column:
duplicated_column = 'column'
df_tmp = df[duplicated_column].T
df_tmp = df_tmp.iloc[1: , :]
df = df.drop([duplicated_column], axis=1)
df = pd.concat([df, df_tmp.T], axis=1)
Upvotes: 2
Reputation: 1632
This post comes up top when you search for the error but the answers are not complete, so let me add mine. There is another reason this error can happen: If you have duplicate columns in your data frames, you will not be able to concatenate and raise this. In fact, even in the original question there are two columns called timestamp
. So it will be better to check if len(df.columns) == len(set(df.columns))
for all the data frames you are trying to concatenate.
Upvotes: 32
Reputation: 1314
Another reason for this issue might be that df1
and df2
might have the same indices, between each other. For example, both the dfs might have the same index idx1
.
To check if this is the issue, you can see if the following outputs not an empty list:
print([org_name for org_name in cum_df.index if org_name in df_from_2002.index])
My suggested solution then would be to rename the indices (so df1
would keep having idx1
and you would change idx1
to idx2
in df2
) and after concatenating (df1 = pd.concat([df1, df2])
), combine the two indices (in case you need to get the sum of them) with this code:
df1.iloc[idx1] = df1.iloc[[idx1, idx2]].sum()
and then remove idx2
:
df1.drop([idx2], inplace=True)
Upvotes: 3
Reputation: 159
This happens also when you have duplicates in the columns names.
Upvotes: 4
Reputation: 2865
In my case the problem was because I had duplicated column names.
Upvotes: 145
Reputation: 1328
As a complement of Nicholas Morley's answer, when you find even this not works:
df = df.reset_index(drop=True)
You should check whether the columns are unique. When they are not, even reseting index not works. Duplicated columns should be removed first to make it works.
Upvotes: 8
Reputation: 314
best solution from this page:
https://pandas.pydata.org/pandas-docs/version/0.20/merging.html
df = pd.concat([df1, df2], axis=1, join_axes=[df1.index])
Upvotes: -1
Reputation: 4240
You can mitigate this error without having to change your data or remove duplicates. Just create a new index with DataFrame.reset_index:
df = df.reset_index()
The old index is kept as a column in your dataframe, but if you don't need it you can do:
df = df.reset_index(drop=True)
Some prefer:
df.reset_index(inplace=True, drop=True)
Upvotes: 117
Reputation: 879401
pd.concat
requires that the indices be unique. To remove rows with duplicate indices, use
df = df.loc[~df.index.duplicated(keep='first')]
import pandas as pd
from pandas import Timestamp
df1 = pd.DataFrame(
{'price': [0.7286, 0.7286, 0.7286, 0.7286],
'side': [2, 2, 2, 2],
'timestamp': [1451865675631331, 1451865675631400,
1451865675631861, 1451865675631866]},
index=pd.DatetimeIndex(['2000-1-1', '2000-1-1', '2001-1-1', '2002-1-1']))
df2 = pd.DataFrame(
{'bid': [0.7284, 0.7284, 0.7284, 0.7285, 0.7285],
'bid_size': [4000000, 4000000, 5000000, 1000000, 4000000],
'offer': [0.7285, 0.729, 0.7286, 0.7286, 0.729],
'offer_size': [1000000, 4000000, 4000000, 4000000, 4000000]},
index=pd.DatetimeIndex(['2000-1-1', '2001-1-1', '2002-1-1', '2003-1-1', '2004-1-1']))
df1 = df1.loc[~df1.index.duplicated(keep='first')]
# price side timestamp
# 2000-01-01 0.7286 2 1451865675631331
# 2001-01-01 0.7286 2 1451865675631861
# 2002-01-01 0.7286 2 1451865675631866
df2 = df2.loc[~df2.index.duplicated(keep='first')]
# bid bid_size offer offer_size
# 2000-01-01 0.7284 4000000 0.7285 1000000
# 2001-01-01 0.7284 4000000 0.7290 4000000
# 2002-01-01 0.7284 5000000 0.7286 4000000
# 2003-01-01 0.7285 1000000 0.7286 4000000
# 2004-01-01 0.7285 4000000 0.7290 4000000
result = pd.concat([df1, df2], axis=0)
print(result)
bid bid_size offer offer_size price side timestamp
2000-01-01 NaN NaN NaN NaN 0.7286 2 1.451866e+15
2001-01-01 NaN NaN NaN NaN 0.7286 2 1.451866e+15
2002-01-01 NaN NaN NaN NaN 0.7286 2 1.451866e+15
2000-01-01 0.7284 4000000 0.7285 1000000 NaN NaN NaN
2001-01-01 0.7284 4000000 0.7290 4000000 NaN NaN NaN
2002-01-01 0.7284 5000000 0.7286 4000000 NaN NaN NaN
2003-01-01 0.7285 1000000 0.7286 4000000 NaN NaN NaN
2004-01-01 0.7285 4000000 0.7290 4000000 NaN NaN NaN
Note there is also pd.join
, which can join DataFrames based on their indices,
and handle non-unique indices based on the how
parameter. Rows with duplicate
index are not removed.
In [94]: df1.join(df2)
Out[94]:
price side timestamp bid bid_size offer \
2000-01-01 0.7286 2 1451865675631331 0.7284 4000000 0.7285
2000-01-01 0.7286 2 1451865675631400 0.7284 4000000 0.7285
2001-01-01 0.7286 2 1451865675631861 0.7284 4000000 0.7290
2002-01-01 0.7286 2 1451865675631866 0.7284 5000000 0.7286
offer_size
2000-01-01 1000000
2000-01-01 1000000
2001-01-01 4000000
2002-01-01 4000000
In [95]: df1.join(df2, how='outer')
Out[95]:
price side timestamp bid bid_size offer offer_size
2000-01-01 0.7286 2 1.451866e+15 0.7284 4000000 0.7285 1000000
2000-01-01 0.7286 2 1.451866e+15 0.7284 4000000 0.7285 1000000
2001-01-01 0.7286 2 1.451866e+15 0.7284 4000000 0.7290 4000000
2002-01-01 0.7286 2 1.451866e+15 0.7284 5000000 0.7286 4000000
2003-01-01 NaN NaN NaN 0.7285 1000000 0.7286 4000000
2004-01-01 NaN NaN NaN 0.7285 4000000 0.7290 4000000
Upvotes: 70