Rocketq
Rocketq

Reputation: 5791

Pandas Concat increases number of rows

I'm concatenating two dataframes, so I want to one dataframe is located to another. But first I did some transformation to initial dataframe:

scaler = MinMaxScaler() 
real_data = pd.DataFrame(scaler.fit_transform(df[real_columns]), columns = real_columns)

And then concatenate:

categorial_data  = pd.get_dummies(df[categor_columns], prefix_sep= '__')
train = pd.concat([real_data, categorial_data], axis=1, ignore_index=True)

I dont know why, but number of rows increased:

print(df.shape, real_data.shape, categorial_data.shape, train.shape)
(1700645, 23) (1700645, 16) (1700645, 130) (1703915, 146)

What happened and how fix the problem?

As you can see number of columns for train equals to sum of columns real_data and categorial_data

Upvotes: 30

Views: 25144

Answers (4)

Lucky Suman
Lucky Suman

Reputation: 392

While Performing some operations on a dataframe, its dimensions change not the indices, hence we need to perform reset_index operation on the dataframe.

For concatenation you can do like this:

result_df = pd.concat([first_df.reset_index(drop=True), second_df.reset_index(drop=True)], axis=1)

Upvotes: 16

SACHIN KUMAR
SACHIN KUMAR

Reputation: 11

This happens when the indices of dataframes being concatenated differ. After preprocessing, the index of the resultant dataframe gets removed. Setting the index of each dataframe back to the original works i.e. df_concatenated.index = df_original.index.

Upvotes: 1

saket ram
saket ram

Reputation: 404

The problem is that sometimes when you perform several operations on a single dataframe object, the index persists in the memory. So using df.reset_index() will solve your problem.

Upvotes: 38

Rocketq
Rocketq

Reputation: 5791

I solved the problem by using hstack

train = pd.DataFrame(np.hstack([real_data,categorial_data]))

Upvotes: 4

Related Questions