Reputation: 1775
pandas newbie question:
I have a dataframe with millions of rows, a sample output would be:
c_id c1 c2
0 10 100
0 15 110
0 15 112
2 96 120
56 43 42
for each customer_id, i want to create a table do some stuff to it. What's the best way to do it? I sorted the dataframe by c_id, then set the index to it:
df = df.sort('c_id', ascending=False)
df = df.set_index('c_id')
but a simple operation like:
temp_df = df.loc[:0]
takes forever, what's the fastest way to approach this problem? I thought a sorted set_index would do the trick. I guess not.
EDIT1:
I want to get the list of all the unique values of c1, for each value of c_id. so something like:
df.loc[:0].c1.unique()
Upvotes: 0
Views: 3239
Reputation: 7476
Don't explicitly create groups, but use pandas groupby.
For example, say that you want to find the average value for client, you can do:
df.groupby(by = df['c_id']).mean()
and so on.
You can also apply (almost) arbitrary transformations, using .apply
and .transform
methods (although in-built methods like mean, std, min, max is much more efficient, as they are optimised).
To answer your specific question, you can do:
df.groupby('c_id').c1.nunique()
which gives:
c_id
0 2
20 1
56 1
Name: c1, dtype: int64
Notice that some questions (this and this) suggest that .nunique is not the faster way to go and this is the alternative way to go:
df.groupby('c_id').c1.apply(lambda x: len(x.unique()))
(I haven't done any benchmarking myself...)
Upvotes: 1