Pandas: What is the fastest way to search a large dataframe

Question

pandas newbie question:

I have a dataframe with millions of rows, a sample output would be:

c_id  c1   c2
0     10  100
0     15  110
0     15  112
2     96  120
56    43  42

for each customer_id, i want to create a table do some stuff to it. What's the best way to do it? I sorted the dataframe by c_id, then set the index to it:

df = df.sort('c_id', ascending=False)
df = df.set_index('c_id')

but a simple operation like:

temp_df = df.loc[:0]

takes forever, what's the fastest way to approach this problem? I thought a sorted set_index would do the trick. I guess not.

EDIT1:

I want to get the list of all the unique values of c1, for each value of c_id. so something like:

df.loc[:0].c1.unique()

FLab · Accepted Answer

Don't explicitly create groups, but use pandas groupby.

For example, say that you want to find the average value for client, you can do:

df.groupby(by = df['c_id']).mean()

and so on.

You can also apply (almost) arbitrary transformations, using .apply and .transform methods (although in-built methods like mean, std, min, max is much more efficient, as they are optimised).

To answer your specific question, you can do:

df.groupby('c_id').c1.nunique()

which gives:

c_id
0     2
20    1
56    1
Name: c1, dtype: int64

Notice that some questions (this and this) suggest that .nunique is not the faster way to go and this is the alternative way to go:

df.groupby('c_id').c1.apply(lambda x: len(x.unique()))

(I haven't done any benchmarking myself...)

Pandas: What is the fastest way to search a large dataframe

Answers (1)

Related Questions