Reputation: 5791
I'm concatenating two dataframes, so I want to one dataframe is located to another. But first I did some transformation to initial dataframe:
scaler = MinMaxScaler()
real_data = pd.DataFrame(scaler.fit_transform(df[real_columns]), columns = real_columns)
And then concatenate:
categorial_data = pd.get_dummies(df[categor_columns], prefix_sep= '__')
train = pd.concat([real_data, categorial_data], axis=1, ignore_index=True)
I dont know why, but number of rows increased:
print(df.shape, real_data.shape, categorial_data.shape, train.shape)
(1700645, 23) (1700645, 16) (1700645, 130) (1703915, 146)
What happened and how fix the problem?
As you can see number of columns for train equals to sum of columns real_data and categorial_data
Upvotes: 30
Views: 25144
Reputation: 392
While Performing some operations on a dataframe, its dimensions change not the indices, hence we need to perform reset_index
operation on the dataframe.
For concatenation you can do like this:
result_df = pd.concat([first_df.reset_index(drop=True), second_df.reset_index(drop=True)], axis=1)
Upvotes: 16
Reputation: 11
This happens when the indices of dataframes being concatenated differ. After preprocessing, the index of the resultant dataframe gets removed. Setting the index of each dataframe back to the original works i.e. df_concatenated.index = df_original.index
.
Upvotes: 1
Reputation: 404
The problem is that sometimes when you perform several operations on a single dataframe object, the index persists in the memory. So using df.reset_index() will solve your problem.
Upvotes: 38
Reputation: 5791
I solved the problem by using hstack
train = pd.DataFrame(np.hstack([real_data,categorial_data]))
Upvotes: 4