Reputation: 11
I have a very large dataframe with a multiindex. I need to pass one column to C to do an operation quickly. For this operation, I need to know where the multiindex changes values. Since this is a large dataframe, I don't want to iterate over the rows or index within python. A small example:
import numpy as np
import pandas as pd
a = np.array([['bar', 'one', 0, 0],
['bar', 'two', 1, 2],
['bar', 'one', 2, 4],
['bar', 'two', 3, 6],
['foo', 'one', 4, 8],
['foo', 'two', 5, 10],
['bar', 'one', 6, 12],
['bar', 'two', 7, 14]], dtype=object)
df = pd.DataFrame(a, columns=['ix0', 'ix1', 'cd0', 'cd1'])
df.sort_values(['ix0', 'ix1'], inplace=True)
df.set_index(['ix0', 'ix1'], inplace=True)
The dataframe looks like this:
In [7]: df
Out[7]:
cd0 cd1
ix0 ix1
bar one 0 0
one 2 4
one 6 12
two 1 2
two 3 6
two 7 14
foo one 4 8
two 5 10
Now I want an array or list that shows where the values in the multiindex change. I.e., the integer index where (bar, one) changes to (bar, two), (bar, two) changes to (foo, one), etc.
To be able to build the hierarchical output, it seems that this data must exist in the index. Is there a way to get to it?
The example output I'm looking for would be: [0, 3, 6, 7].
Thanks
Upvotes: 1
Views: 300
Reputation: 879661
You could use np.unique
with return_index=True
:
In [69]: uniques, indices = np.unique(df.index, return_index=True)
In [70]: indices
Out[70]: array([0, 3, 6, 7])
Upvotes: 1