JC_CL
JC_CL

Reputation: 2608

Adding a new index row to an existing dataframe and sorting by it

I have a large pandas dataframe, with time series data and a rather large multiindex. Said index contains various information about the time series, such as for example location, datatype and so on.

Now I want to add a new row to the index, with an integer (or float, doesnt really matter), containing a distance to a certain point. Following that, I want to sort the dataframe by this distance.

I am not sure how to add a new index level, and how to assign new values by hand. Also, can pandas even sort columns after an random number in one of its index levels?

Example

(code from here )

header=pd.MultiIndex.from_product([['location1','location2'],['S1','S2','S3']],names=['loc','S'])
df = pd.DataFrame(np.random.randn(5, 6), index=['a','b','c','d','e'], columns = header)

Looks like this:

loc  location1                      location2                    
S           S1        S2        S3         S1        S2        S3
a     1.530590  0.536364  1.295848   0.422256 -1.853786  1.334981
b     0.275857 -0.848685 -1.212584  -0.464235 -0.855600  0.680985
c    -1.209607  0.265359 -0.695233   0.643896  1.315216 -0.751027
d    -1.591613 -0.178605  0.878567   0.647389 -0.454313 -1.972509
e     1.098193 -0.766810  0.087173   0.714301 -0.886545 -0.826163

What I want to to, is in a first step, add some distances to each column, like location1 S1 add dist 200, location1 S2 add dist 760 and so on, resulting in this:

loc  location1                      location2                    
S           S1        S2        S3         S1        S2        S3
dist       200       760        10       1000       340        70
a     1.530590  0.536364  1.295848   0.422256 -1.853786  1.334981
b     0.275857 -0.848685 -1.212584  -0.464235 -0.855600  0.680985
c    -1.209607  0.265359 -0.695233   0.643896  1.315216 -0.751027
d    -1.591613 -0.178605  0.878567   0.647389 -0.454313 -1.972509
e     1.098193 -0.766810  0.087173   0.714301 -0.886545 -0.826163

And then do something like df.sortlevel('dist'), resulting in

loc location1 location2 location1 location2 location1 location2
S          S3        S3        S1        S2        S2        S1
dist       10        70       200       340       760      1000
a    1.295848  1.334981  1.530590 -1.853786  0.536364  0.422256
b   -1.212584  0.680985  0.275857 -0.855600 -0.848685 -0.464235
…

So that have the whole thing sorted by distance, for things like plt.matshow(df.corr()).

Can pandas even sort a df after a random index with an integer? Because I have another dataframe, that already has an integer in its multindex, and here some_otherdf.sortlevel('HZB') results in TypeError: can only sort by level with a hierarchical index

Edit:

As of now, there is two answers, both of which work perfectly fine for my test case. I think @Pedro M Duarte's answer might be the more correct one, given that it uses the multiindex as intended. However, for my real data, it would require either a lot of reworking, or a lot of typing, for a 7 levels deep multiindex and 50 data series, which is very error prone. @Nader Hisham ignored my request to stay in my multiindex, but it requires just the quick, easy and easy to check writing of a simple row of numbers (saving me a lot of time), that I can then remove after sorting. for other folks with a similar question, it might be different.

Upvotes: 0

Views: 1408

Answers (2)

Nader Hisham
Nader Hisham

Reputation: 5414

In [35]:
df.loc['dist' , : ] = [200,760,10,1000,340,70]
df
Out[35]:
loc                location1                 location2
S     S1              S2            S3            S1          S2    S3
a     0.348766  -0.326088   -0.891929   -0.704856   -1.514304   0.611692
b    -0.546026  -0.111232   -1.022104   -1.246002   0.328385    0.576465
c    -0.743512  -0.362791   -0.617021   -0.859157   -0.300368   0.292980
d     0.090178  1.369648    0.171753    -0.411466   0.478654    1.814878
e    -0.380414  -1.568492   -0.432858   1.034861    -0.633563   1.403627
dist 200.000000 760.000000  10.000000   1000.000000 340.000000  70.000000


In [36]:
order = np.argsort(df.loc['dist' , :]).values
order
Out[36]:
array([2, 5, 0, 4, 1, 3], dtype=int64)

In [37]:

df.iloc[: , order]
Out[37]:
loc    location1    location2   location1   location2   location1   location2
S            S3      S3            S1         S2          S2           S1
a     -0.891929    0.611692     0.348766    -1.514304   -0.326088   -0.704856
b     -1.022104    0.576465    -0.546026    0.328385    -0.111232   -1.246002
c     -0.617021    0.292980    -0.743512    -0.300368   -0.362791   -0.859157
d     0.171753     1.814878     0.090178    0.478654    1.369648    -0.411466
e     -0.432858    1.403627     -0.380414   -0.633563   -1.568492   1.034861
dist  10.000000    70.000000    200.000000  340.000000  760.000000  1000.000000

if you want to make your dist index as the first index you can do the following

Upvotes: 1

Pedro M Duarte
Pedro M Duarte

Reputation: 28093

In[1]:
import pandas as pd
import numpy as np

header=pd.MultiIndex.from_product(
    [['location1','location2'],['S1','S2','S3']],
    names=['loc','S'])

df = pd.DataFrame(np.random.randn(5, 6), 
                  index=['a','b','c','d','e'], columns = header)

print(df)

Out[1]:
    loc location1                     location2                    
    S          S1        S2        S3        S1        S2        S3
    a    0.503357 -0.461202 -1.412865  0.866237  1.290292  0.635869
    b   -0.904658 -1.190422 -0.198654 -0.916884 -1.070291 -1.918091
    c   -1.448068 -0.121475 -0.838693  0.047861 -0.131904  1.154370
    d    1.758752 -0.094962 -2.035204 -0.399195 -0.756726  1.609393
    e    0.421521  1.134518 -0.809148 -0.543523 -1.161328  1.261901



In[2]:
distances = {
    ('location1','S1'): 200,
    ('location1','S2'): 760,
    ('location1','S3'): 10,
    ('location2','S1'): 1000,
    ('location2','S2'): 340,
    ('location2','S3'): 70,
}

index = df.columns
df.columns = pd.MultiIndex.from_tuples(
    [(key[0], key[1], distances[key],) for key in index.get_values()],
    names=[index.get_level_values(0).name,
           index.get_level_values(1).name,
           'dist']
)
print(df)

Out[2]:
    loc  location1                     location2                    
    S           S1        S2        S3        S1        S2        S3
    dist      200       760       10        1000      340       70  
    a     0.503357 -0.461202 -1.412865  0.866237  1.290292  0.635869
    b    -0.904658 -1.190422 -0.198654 -0.916884 -1.070291 -1.918091
    c    -1.448068 -0.121475 -0.838693  0.047861 -0.131904  1.154370
    d     1.758752 -0.094962 -2.035204 -0.399195 -0.756726  1.609393
    e     0.421521  1.134518 -0.809148 -0.543523 -1.161328  1.261901



In[3]:
result = df.sortlevel(level=2, axis=1)
print(result)

Out[3]:
    loc  location1 location2 location1 location2 location1 location2
    S           S3        S3        S1        S2        S2        S1
    dist      10        70        200       340       760       1000
    a    -1.412865  0.635869  0.503357  1.290292 -0.461202  0.866237
    b    -0.198654 -1.918091 -0.904658 -1.070291 -1.190422 -0.916884
    c    -0.838693  1.154370 -1.448068 -0.131904 -0.121475  0.047861
    d    -2.035204  1.609393  1.758752 -0.756726 -0.094962 -0.399195
    e    -0.809148  1.261901  0.421521 -1.161328  1.134518 -0.543523

Upvotes: 2

Related Questions