Reputation: 2608
I have a large pandas dataframe, with time series data and a rather large multiindex. Said index contains various information about the time series, such as for example location, datatype and so on.
Now I want to add a new row to the index, with an integer (or float, doesnt really matter), containing a distance to a certain point. Following that, I want to sort the dataframe by this distance.
I am not sure how to add a new index level, and how to assign new values by hand. Also, can pandas even sort columns after an random number in one of its index levels?
Example
(code from here )
header=pd.MultiIndex.from_product([['location1','location2'],['S1','S2','S3']],names=['loc','S'])
df = pd.DataFrame(np.random.randn(5, 6), index=['a','b','c','d','e'], columns = header)
Looks like this:
loc location1 location2
S S1 S2 S3 S1 S2 S3
a 1.530590 0.536364 1.295848 0.422256 -1.853786 1.334981
b 0.275857 -0.848685 -1.212584 -0.464235 -0.855600 0.680985
c -1.209607 0.265359 -0.695233 0.643896 1.315216 -0.751027
d -1.591613 -0.178605 0.878567 0.647389 -0.454313 -1.972509
e 1.098193 -0.766810 0.087173 0.714301 -0.886545 -0.826163
What I want to to, is in a first step, add some distances to each column, like location1 S1 add dist 200
, location1 S2 add dist 760
and so on, resulting in this:
loc location1 location2
S S1 S2 S3 S1 S2 S3
dist 200 760 10 1000 340 70
a 1.530590 0.536364 1.295848 0.422256 -1.853786 1.334981
b 0.275857 -0.848685 -1.212584 -0.464235 -0.855600 0.680985
c -1.209607 0.265359 -0.695233 0.643896 1.315216 -0.751027
d -1.591613 -0.178605 0.878567 0.647389 -0.454313 -1.972509
e 1.098193 -0.766810 0.087173 0.714301 -0.886545 -0.826163
And then do something like df.sortlevel('dist')
, resulting in
loc location1 location2 location1 location2 location1 location2
S S3 S3 S1 S2 S2 S1
dist 10 70 200 340 760 1000
a 1.295848 1.334981 1.530590 -1.853786 0.536364 0.422256
b -1.212584 0.680985 0.275857 -0.855600 -0.848685 -0.464235
…
So that have the whole thing sorted by distance, for things like plt.matshow(df.corr())
.
Can pandas even sort a df after a random index with an integer? Because I have another dataframe, that already has an integer in its multindex, and here some_otherdf.sortlevel('HZB')
results in TypeError: can only sort by level with a hierarchical index
Edit:
As of now, there is two answers, both of which work perfectly fine for my test case. I think @Pedro M Duarte's answer might be the more correct one, given that it uses the multiindex as intended. However, for my real data, it would require either a lot of reworking, or a lot of typing, for a 7 levels deep multiindex and 50 data series, which is very error prone. @Nader Hisham ignored my request to stay in my multiindex, but it requires just the quick, easy and easy to check writing of a simple row of numbers (saving me a lot of time), that I can then remove after sorting. for other folks with a similar question, it might be different.
Upvotes: 0
Views: 1408
Reputation: 5414
In [35]:
df.loc['dist' , : ] = [200,760,10,1000,340,70]
df
Out[35]:
loc location1 location2
S S1 S2 S3 S1 S2 S3
a 0.348766 -0.326088 -0.891929 -0.704856 -1.514304 0.611692
b -0.546026 -0.111232 -1.022104 -1.246002 0.328385 0.576465
c -0.743512 -0.362791 -0.617021 -0.859157 -0.300368 0.292980
d 0.090178 1.369648 0.171753 -0.411466 0.478654 1.814878
e -0.380414 -1.568492 -0.432858 1.034861 -0.633563 1.403627
dist 200.000000 760.000000 10.000000 1000.000000 340.000000 70.000000
In [36]:
order = np.argsort(df.loc['dist' , :]).values
order
Out[36]:
array([2, 5, 0, 4, 1, 3], dtype=int64)
In [37]:
df.iloc[: , order]
Out[37]:
loc location1 location2 location1 location2 location1 location2
S S3 S3 S1 S2 S2 S1
a -0.891929 0.611692 0.348766 -1.514304 -0.326088 -0.704856
b -1.022104 0.576465 -0.546026 0.328385 -0.111232 -1.246002
c -0.617021 0.292980 -0.743512 -0.300368 -0.362791 -0.859157
d 0.171753 1.814878 0.090178 0.478654 1.369648 -0.411466
e -0.432858 1.403627 -0.380414 -0.633563 -1.568492 1.034861
dist 10.000000 70.000000 200.000000 340.000000 760.000000 1000.000000
if you want to make your dist
index as the first index you can do the following
Upvotes: 1
Reputation: 28093
In[1]:
import pandas as pd
import numpy as np
header=pd.MultiIndex.from_product(
[['location1','location2'],['S1','S2','S3']],
names=['loc','S'])
df = pd.DataFrame(np.random.randn(5, 6),
index=['a','b','c','d','e'], columns = header)
print(df)
Out[1]:
loc location1 location2
S S1 S2 S3 S1 S2 S3
a 0.503357 -0.461202 -1.412865 0.866237 1.290292 0.635869
b -0.904658 -1.190422 -0.198654 -0.916884 -1.070291 -1.918091
c -1.448068 -0.121475 -0.838693 0.047861 -0.131904 1.154370
d 1.758752 -0.094962 -2.035204 -0.399195 -0.756726 1.609393
e 0.421521 1.134518 -0.809148 -0.543523 -1.161328 1.261901
In[2]:
distances = {
('location1','S1'): 200,
('location1','S2'): 760,
('location1','S3'): 10,
('location2','S1'): 1000,
('location2','S2'): 340,
('location2','S3'): 70,
}
index = df.columns
df.columns = pd.MultiIndex.from_tuples(
[(key[0], key[1], distances[key],) for key in index.get_values()],
names=[index.get_level_values(0).name,
index.get_level_values(1).name,
'dist']
)
print(df)
Out[2]:
loc location1 location2
S S1 S2 S3 S1 S2 S3
dist 200 760 10 1000 340 70
a 0.503357 -0.461202 -1.412865 0.866237 1.290292 0.635869
b -0.904658 -1.190422 -0.198654 -0.916884 -1.070291 -1.918091
c -1.448068 -0.121475 -0.838693 0.047861 -0.131904 1.154370
d 1.758752 -0.094962 -2.035204 -0.399195 -0.756726 1.609393
e 0.421521 1.134518 -0.809148 -0.543523 -1.161328 1.261901
In[3]:
result = df.sortlevel(level=2, axis=1)
print(result)
Out[3]:
loc location1 location2 location1 location2 location1 location2
S S3 S3 S1 S2 S2 S1
dist 10 70 200 340 760 1000
a -1.412865 0.635869 0.503357 1.290292 -0.461202 0.866237
b -0.198654 -1.918091 -0.904658 -1.070291 -1.190422 -0.916884
c -0.838693 1.154370 -1.448068 -0.131904 -0.121475 0.047861
d -2.035204 1.609393 1.758752 -0.756726 -0.094962 -0.399195
e -0.809148 1.261901 0.421521 -1.161328 1.134518 -0.543523
Upvotes: 2