user1363251
user1363251

Reputation: 421

Pandas: concat function removes prior sorting of dataframes

Consider two dataframes called "socio_demo" ([198 rows x 15 columns]) and UPDRS_sorted([198 rows x 70 columns]). Let's do:

socio_demo_sorted = socio_demo.sort_values(['NUMERO_CENTRE_1','NUMERO_INCLUSION_1'])
UPDRS_sorted = UPDRS.sort_values(['NUMERO_CENTRE_2','NUMERO_INCLUSION_2'])

UPDRS_sorted['NUMERO_CENTRE_2'] gives

Out[22]: 
3     1
9     1
13    1
18    1
24    1
     ..
6     6
16    6
20    6
25    6
34    6
Name: NUMERO_CENTRE_2, Length: 198, dtype: int64

Now let's concatenate the two sorted datasets:

frames = [socio_demo_sorted,UPDRS_sorted]
full_data = pd.concat(frames,axis = 1)

which gives the expected [198 rows x 85 columns] shape. However, doing

full_data['NUMERO_CENTRE_2']

returns the original (non-sorted) UPDRS data:

0      3
1      4
2      2
3      1
4      5
      ..
193    1
194    1
195    1
196    1
197    1
Name: NUMERO_CENTRE_2, Length: 198, dtype: int64

I don't understand why the effect of the ".sort_values" function is lost here.

Upvotes: 2

Views: 715

Answers (2)

SeaBean
SeaBean

Reputation: 23217

The row indexes of the original unsorted dataframes were retained after sorting (although they were shuffled after sorting). After you concat the 2 sorted dataframes, the concatenated dataframe was re-arranged based on these original indexes. Hence, returned to the unsorted orders.

You can solve this either by resetting index with .reset_index(drop=True) of the sorted dataframes or directly by using parameter ignore_index=True during the sort step:

Use either:

socio_demo_sorted = socio_demo.sort_values(['NUMERO_CENTRE_1','NUMERO_INCLUSION_1']).reset_index(drop=True)
UPDRS_sorted = UPDRS.sort_values(['NUMERO_CENTRE_2','NUMERO_INCLUSION_2']).reset_index(drop=True)

or by:

socio_demo_sorted = socio_demo.sort_values(['NUMERO_CENTRE_1','NUMERO_INCLUSION_1'], ignore_index=True)
UPDRS_sorted = UPDRS.sort_values(['NUMERO_CENTRE_2','NUMERO_INCLUSION_2'], ignore_index=True)

Then, concat as per your codes:

frames = [socio_demo_sorted,UPDRS_sorted]
full_data = pd.concat(frames,axis = 1)

Upvotes: 2

BENY
BENY

Reputation: 323266

In your case when concat do ignore_index

out = pd.concat(frames,ignore_index=True)

Upvotes: 1

Related Questions