권효섭
권효섭

Reputation: 69

Dataframe hierarchical indexing speedup

I have dataframe like this

+----+------------+------------+------------+
|    |            |    type    | payment    | 
+----+------------+------------+------------+
| id | res_number |            |            | 
+----+------------+------------+------------+
|  a |     1      |    toys    | 20000      |
|    |     2      |  clothing  | 30000      |
|    |     3      |    food    | 40000      |
|  b |     4      |    food    | 40000      |
|    |     5      |   laptop   | 30000      |
+----+------------+------------+------------+

as you can see id, and res_number are hierachical row value, and type, payment are normal columns value. What i want to get is below.

array([['toys', 20000],
   ['clothing', 30000],
   ['food', 40000]])

It indexed by 'id(=a)' no matter what 'res_number' came, and i know that

df.loc[['a']].values

perfectly works for it. But the speed of indexing is too slow... i have to index 150000 values.

so i indexed dataframe by

df.iloc[1].values

but it only brought

array(['toys', 20000])

is there any indexing method more faster in indexing hierarchical structure?

Upvotes: 3

Views: 223

Answers (3)

Tai
Tai

Reputation: 7994

Another option. Keep an extra dictionary of the beginning and ending indices of each group. ( Assume the index is sorted. )

Option 1 Use the first and the last index in a group to query with iloc.

d = {k: slice(v[0], v[-1]+1) for k, v in df.groupby("id").indices.items()}
df.iloc[d["b"]]

array([['food', 40000],
       ['laptop', 30000]], dtype=object)

Option 2 Use the first and the last index to query with numpy's index slicing on df.values.

df.values[d["a"]] 

Timing

df_testing = pd.DataFrame({"id": [str(v) for v in np.random.randint(0, 100, 150000)],
                        "res_number": np.arange(150000),
                        "payment": [v for v in np.random.randint(0, 100000, 150000)]}
             ).set_index(["id","res_number"]).sort_index()
d = {k: slice(v[0], v[-1]+1) for k, v in df_testing.groupby("id").indices.items()}
# by COLDSPEED
%timeit df_testing.xs('5').values
303 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# by OP
%timeit df_testing.loc['5'].values
358 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Tai 1
%timeit df_testing.iloc[d["5"]].values 
130 µs ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Tai 2
%timeit df_testing.values[d["5"]] 
7.26 µs ± 845 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

However, getting d is not costless.

%timeit {k: slice(v[0], v[-1]+1) for k, v in df_testing.groupby("id").indices.items()}
16.3 ms ± 6.89 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Whether creating an extra lookup table d worth it?

The cost of creating the index will be spread on the gain from doing queries. In my toy dataset, it will be 16.3 ms / (300 us - 7 us) ≈ 56 queries to recover the cost of creating the index.

Again, the index needs to be sorted.

Upvotes: 0

cs95
cs95

Reputation: 402393

Option 1
pd.DataFrame.xs

df.xs('a').values

Option 2
pd.DataFrame.loc

df.loc['a'].values

Option 3
pd.DataFrame.query

df.query('ilevel_0 == \'a\'').values

Option 4
A bit more roundabout, use pd.MultiIndex.get_level_values to create a mask:

df[df.index.get_level_values(0) == 'a'].values

array([['toys', 20000],
       ['clothing', 30000],
       ['food', 40000]], dtype=object)

Upvotes: 4

Scott Boston
Scott Boston

Reputation: 153460

Option 5

Use .loc with axis parameter

 df.loc(axis=0)['a',:].values

Output:

array([['toys', 20000],
       ['clothing', 30000],
       ['food', 40000]], dtype=object)

Upvotes: 1

Related Questions