parchment
parchment

Reputation: 4002

Getting the last element of a level in a multiindex

I have a dataframe in this format:

a   b   x
1   1   31
1   2   1
1   3   42
1   4   423
1   5   42
1   6   3
1   7   44
1   8   65437
1   9   73
2   1   5656
2   2   7
2   3   5
2   4   5
2   5   34

a and b are indexes, x is the value.

I want to get rows 1 9 73 and 2 5 34, in other words, the last row of that level.

I've been messing with .loc, .iloc, and .xs for an hour, but I can't get it to work. How do I do this?

Upvotes: 7

Views: 2931

Answers (3)

Anakhand
Anakhand

Reputation: 2998

Special case

The groupby solution presented by jezrael is the high-level, general solution. But when a groupby generates a lot of distinct groups (in the example presented by OP, that would be caused by a lot of distinct values for a), it performs quite poorly. Here I'm presenting an optimized solution for a special case (which matches OP's case).

Suppose you have a dataframe indexed by a MultiIndex with several levels, and the last of those levels has values that always start, within each group, at the same value; for instance suppose that the values always start from 1 and count up. In the following example this would be the number level.

                value
name number          
a    1       0.548126
b    1       0.774775
     2       0.483701
     3       0.820758
c    1       0.696832
     2       0.905071
d    1       0.750546
     2       0.761081
e    1       0.944682
     2       0.336210

Then, to get the cross section of the rows with maximum/last number value for every unique value of name (or combination of values of whatever other levels you have), you can do:

df[np.roll(df.index.get_level_values('number') == 1, -1)]

and you get:

                value
name number          
a    1       0.548126
b    3       0.820758
c    2       0.905071
d    2       0.761081
e    2       0.336210

Explanation

Piece-by-piece:

  • df.index.get_level_values('number'): gets an array of the values for the number level for each row
  • df.index.get_level_values('number') == 1: boolean array that is True for those rows in which number is 1
  • np.roll(df.index.get_level_values('number') == 1, -1): shift all the values of the previous array backwards by one position in a circular manner (i.e. the first element becomes last, the second, first, and so on).

The idea is, the last value of a group will always come immediately before the first value of the group, which is always 1. Therefore, if we get a boolean mask for the rows which have a number value of 1, we can just shift all of those booleans backwards by one, and we get a mask for the last values of number.

The special case of the last row is taken into account by shifting circularly, so that the first boolean ends up last—the first row always has number equal to 1, thus that boolean will always be True, therefore the last row always gets selected (as expected).

Generic function

def innermost_level_max(df, start_value=1, drop_level=False):
    assert df.index.is_lexsorted()
    level_values = df.index.get_level_values(-1)
    result = df[np.roll(level_values == start_value, -1)]
    if drop_level:
        result = result.droplevel(-1)
    return result

Setup code to play around

import itertools as itt

import numpy as np
import pandas as pd
import perfplot

rng = np.random.default_rng(42)


def generate_names():
    alphabet = [chr(i) for i in range(ord('a'), ord('z') + 1)]
    for length in itt.count(1):
        for tup in itt.product(*([alphabet]*length)):
            yield ''.join(tup)


def make_ragged_df(n):
    lengths = rng.integers(1, 3, endpoint=True, size=n)
    names = np.fromiter(
        itt.chain.from_iterable(itt.repeat(n, times=r) for n, r in zip(generate_names(), lengths)),
        dtype='U100',
        count=n
    )
    numbers = np.fromiter(itt.chain.from_iterable(map(range, lengths)), int, count=n) + 1
    index = pd.MultiIndex.from_arrays([names, numbers], names=['name', 'number'])
    data = np.random.rand(n)
    df = pd.DataFrame({'value': data}, index=index)
    return df

This allows you to create a sample dataframe:

>>> make_ragged_df(10)
                value
name number          
a    1       0.548126
b    1       0.774775
     2       0.483701
     3       0.820758
c    1       0.696832
     2       0.905071
d    1       0.750546
     2       0.761081
e    1       0.944682
     2       0.336210

Performance

Using perfplot:

import perfplot

benchmarks = perfplot.bench(
    setup=lambda n: make_ragged_df(n),
    kernels=[
        lambda df: df.groupby('name', sort=False).tail(1),
        lambda df: df[np.roll(df.index.get_level_values('number') == 1, -1)],
    ],
    labels=["with groupby", "with np.roll on == 1"],
    n_range=range(50, 10000, 500),
    xlabel="total number of rows",
)

benchmarks.show()

performance plot


Even more special case

If you know what the last value of number always is e.g. 3, you don't need anything more than an index slice:

df.loc[pd.IndexSlice[:, 3], :]

or a cross-section:

df.xs(3, level='number')

But probably if this is the case you wouldn't be reading this question to begin with.

Upvotes: 2

jezrael
jezrael

Reputation: 862741

You can use groupby with last:

print (df.groupby('a', as_index=False).last())
   a  b   x
0  1  9  73
1  2  5  34

If a and b are levels of MultiIndex, first call reset_index:

print (df.reset_index().groupby('a', as_index=False).last())
   a  b   x
0  1  9  73
1  2  5  34

Upvotes: 6

Divakar
Divakar

Reputation: 221574

With df as the dataframe and the column a already being sorted, here's an approach -

df[np.append(np.diff(df['a'])>0,True)]

The basic idea is that we perform differentiation along the sorted column a and look for positive change with (>0), giving us a boolean array. The true elements in the boolean array would signal the end of the "group" in that column. Since, there is no change for the last element of the last group, we need to append with a True element to that boolean array at the end. Finally, index df with such a boolean array to select rows off it and give us the desired output.

Another approach could be suggested with np.unique using its optional argument return_index that would give us each group's first occurring element's index. So, to make it work for the last element, just flip the column a, use np.unique and get the first occurring indices and then subtract them from the total number of rows in df. Finally, index into df with those for the final output. Thus, the implementation would be -

df.iloc[df.shape[0] - np.unique(df['a'][::-1],return_index=True)[1] - 1]

Sample run -

>>> df
    a   b   x
0   1  26  46
1   1  17  32
2   1  12  65
3   1  31  96
4   1  34  10
5   1   7  80
6   1  64  50
7   1   0  34
8   1  93  28
9   2  18  92
10  2  59  22
11  2  87  31
>>> df[np.append(np.diff(df['a'])>0,True)]
    a   b   x
8   1  93  28
11  2  87  31
>>> df.iloc[df.shape[0] - np.unique(df['a'][::-1],return_index=True)[1] - 1]
    a   b   x
8   1  93  28
11  2  87  31

Upvotes: 4

Related Questions