Xukrao
Xukrao

Reputation: 8634

Pandas Index: identify subranges where the same value is consecutively repeated

Problem Description

I'm looking for an efficient way to identify all the subranges in a pandas Index object where the same value is consecutively repeated.

Example Problem

As a simple example, consider the following pandas Index object:

import pandas as pd
idx = pd.Index(['X', 'C', 'C', 'C', 'Q', 'Q', 'Q', 'Q', 'A', 'P', 'P'])

In this example the value C is repeated from location 1 through 3, the value Q is repeated from location 4 through 7 and the value P is repeated from location 9 through 10. The result that I'm then trying to get is a list of tuples (or something similar) like this:

[(1, 3, 'C'), (4, 7, 'Q'), (9, 10, 'P')]

Tried thus far

I've been experimenting with the pandas.Index.duplicated property, but with this alone I haven't been able to succeed yet in getting the desired result.

Edit:

Many thanks everyone for the great answers. I have one follow-up question. Suppose that the Index also contains non-contiguous duplicate values like in this example (where value X appears more than once):

idx = pd.Index(['X', 'C', 'C', 'C', 'Q', 'Q', 'Q', 'Q', 'X', 'P', 'P'])

How could you obtain a result that ignores the X values? I.e. how to get the following result for this example:

[(1, 3, 'C'), (4, 7, 'Q'), (9, 10, 'P')]

Upvotes: 2

Views: 104

Answers (2)

Brad Solomon
Brad Solomon

Reputation: 40908

Original question

Where idx = pd.Index(['X', 'C', 'C', 'C', 'Q', 'Q', 'Q', 'Q', 'A', 'P', 'P']).

A bit unconventional but should work, and also seems to be significantly faster:

# Get a new Index which is the unique duplicated values in `idx`
un = idx[idx.duplicated(keep=False)].unique()

# Call `get_loc` on `idx` for each member of `un` above  
# `np.where` gets position of True in boolean Index
res = []
for i in un:
    w = np.where(idx.get_loc(i))[0]
    # w[0], w[-1] analogous to v.min(), v.max() from @MaxU's answer
    res.append((w[0], w[-1], i))

print(res)
# [(1, 3, 'C'), (4, 7, 'Q'), (9, 10, 'P')]

Timing:

%timeit myanswer()
105 µs ± 3.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit maxu()
1.21 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Uncommented:

un = idx[idx.duplicated(keep=False)].unique()
res = []
for i in un:
    w = np.where(idx.get_loc(i))[0]
    res.append((w[0], w[-1], i))

Edited question

Where idx = pd.Index(['X', 'C', 'C', 'C', 'Q', 'Q', 'Q', 'Q', 'X', 'P', 'P']).

To get to un here, first get a boolean index which is True when a value is equal the the value that comes before or after it, and False otherwise. This is analogous to idx.duplicated(keep=False) in the first part.

b = (Series(idx).shift() == idx) | (Series(idx).shift(-1) == idx)
un = idx[b].unique()
# Rest should be the same

Upvotes: 5

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

Here is one way:

In [107]: ix = pd.Series(idx.values)

In [108]: [(v.min(), v.max(),k) for k,v in ix.groupby(ix).groups.items() if len(v) > 1]
Out[108]: [(1, 3, 'C'), (9, 10, 'P'), (4, 7, 'Q')]

Upvotes: 5

Related Questions