Reputation: 4002
I have a dataframe in this format:
a b x
1 1 31
1 2 1
1 3 42
1 4 423
1 5 42
1 6 3
1 7 44
1 8 65437
1 9 73
2 1 5656
2 2 7
2 3 5
2 4 5
2 5 34
a
and b
are indexes, x
is the value.
I want to get rows 1 9 73
and 2 5 34
, in other words, the last row of that level.
I've been messing with .loc
, .iloc
, and .xs
for an hour, but I can't get it to work. How do I do this?
Upvotes: 7
Views: 2931
Reputation: 2998
The groupby
solution presented by jezrael is the high-level, general solution. But when a groupby
generates a lot of distinct groups (in the example presented by OP, that would be caused by a lot of distinct values for a
), it performs quite poorly. Here I'm presenting an optimized solution for a special case (which matches OP's case).
Suppose you have a dataframe indexed by a MultiIndex
with several levels, and the last of those levels has values that always start, within each group, at the same value; for instance suppose that the values always start from 1
and count up. In the following example this would be the number
level.
value
name number
a 1 0.548126
b 1 0.774775
2 0.483701
3 0.820758
c 1 0.696832
2 0.905071
d 1 0.750546
2 0.761081
e 1 0.944682
2 0.336210
Then, to get the cross section of the rows with maximum/last number
value for every unique value of name
(or combination of values of whatever other levels you have), you can do:
df[np.roll(df.index.get_level_values('number') == 1, -1)]
and you get:
value
name number
a 1 0.548126
b 3 0.820758
c 2 0.905071
d 2 0.761081
e 2 0.336210
Piece-by-piece:
df.index.get_level_values('number')
: gets an array of the values for the number
level for each rowdf.index.get_level_values('number') == 1
: boolean array that is True
for those rows in which number
is 1np.roll(df.index.get_level_values('number') == 1, -1)
: shift all the values of the previous array backwards by one position in a circular manner (i.e. the first element becomes last, the second, first, and so on).The idea is, the last value of a group will always come immediately before the first value of the group, which is always 1
. Therefore, if we get a boolean mask for the rows which have a number
value of 1, we can just shift all of those booleans backwards by one, and we get a mask for the last values of number
.
The special case of the last row is taken into account by shifting circularly, so that the first boolean ends up last—the first row always has number
equal to 1
, thus that boolean will always be True
, therefore the last row always gets selected (as expected).
def innermost_level_max(df, start_value=1, drop_level=False):
assert df.index.is_lexsorted()
level_values = df.index.get_level_values(-1)
result = df[np.roll(level_values == start_value, -1)]
if drop_level:
result = result.droplevel(-1)
return result
import itertools as itt
import numpy as np
import pandas as pd
import perfplot
rng = np.random.default_rng(42)
def generate_names():
alphabet = [chr(i) for i in range(ord('a'), ord('z') + 1)]
for length in itt.count(1):
for tup in itt.product(*([alphabet]*length)):
yield ''.join(tup)
def make_ragged_df(n):
lengths = rng.integers(1, 3, endpoint=True, size=n)
names = np.fromiter(
itt.chain.from_iterable(itt.repeat(n, times=r) for n, r in zip(generate_names(), lengths)),
dtype='U100',
count=n
)
numbers = np.fromiter(itt.chain.from_iterable(map(range, lengths)), int, count=n) + 1
index = pd.MultiIndex.from_arrays([names, numbers], names=['name', 'number'])
data = np.random.rand(n)
df = pd.DataFrame({'value': data}, index=index)
return df
This allows you to create a sample dataframe:
>>> make_ragged_df(10)
value
name number
a 1 0.548126
b 1 0.774775
2 0.483701
3 0.820758
c 1 0.696832
2 0.905071
d 1 0.750546
2 0.761081
e 1 0.944682
2 0.336210
Using perfplot
:
import perfplot
benchmarks = perfplot.bench(
setup=lambda n: make_ragged_df(n),
kernels=[
lambda df: df.groupby('name', sort=False).tail(1),
lambda df: df[np.roll(df.index.get_level_values('number') == 1, -1)],
],
labels=["with groupby", "with np.roll on == 1"],
n_range=range(50, 10000, 500),
xlabel="total number of rows",
)
benchmarks.show()
If you know what the last value of number
always is e.g. 3, you don't need anything more than an index slice:
df.loc[pd.IndexSlice[:, 3], :]
or a cross-section:
df.xs(3, level='number')
But probably if this is the case you wouldn't be reading this question to begin with.
Upvotes: 2
Reputation: 862741
You can use groupby
with last
:
print (df.groupby('a', as_index=False).last())
a b x
0 1 9 73
1 2 5 34
If a
and b
are levels of MultiIndex
, first call reset_index
:
print (df.reset_index().groupby('a', as_index=False).last())
a b x
0 1 9 73
1 2 5 34
Upvotes: 6
Reputation: 221574
With df
as the dataframe and the column a
already being sorted, here's an approach -
df[np.append(np.diff(df['a'])>0,True)]
The basic idea is that we perform differentiation along the sorted column a
and look for positive change with (>0)
, giving us a boolean array. The true
elements in the boolean array would signal the end of the "group" in that column. Since, there is no change for the last element of the last group, we need to append with a True
element to that boolean array at the end. Finally, index df
with such a boolean array to select rows off it and give us the desired output.
Another approach could be suggested with np.unique
using its optional argument return_index
that would give us each group's first occurring element's index. So, to make it work for the last element, just flip the column a
, use np.unique
and get the first occurring indices and then subtract them from the total number of rows in df
. Finally, index into df
with those for the final output. Thus, the implementation would be -
df.iloc[df.shape[0] - np.unique(df['a'][::-1],return_index=True)[1] - 1]
Sample run -
>>> df
a b x
0 1 26 46
1 1 17 32
2 1 12 65
3 1 31 96
4 1 34 10
5 1 7 80
6 1 64 50
7 1 0 34
8 1 93 28
9 2 18 92
10 2 59 22
11 2 87 31
>>> df[np.append(np.diff(df['a'])>0,True)]
a b x
8 1 93 28
11 2 87 31
>>> df.iloc[df.shape[0] - np.unique(df['a'][::-1],return_index=True)[1] - 1]
a b x
8 1 93 28
11 2 87 31
Upvotes: 4