Neil
Neil

Reputation: 8247

converting pandas dataframe into series

I have following dataframe.

   order_id   Clusters
0    519     Cluster 5
1    520     Cluster 1
2    521     Cluster 1
3    523     Cluster 5
4    524     Cluster 1
5    525     Cluster 4
6    526     Cluster 4
7    527     Cluster 1
8    528     Cluster 2
9    529     Cluster 5
10   530     Cluster 6
11   531     Cluster 3
12   532     Cluster 1
13   533     Cluster 4
14   534     Cluster 5
15   535     Cluster 5

I want following series out of above dataframe.

Cluster 1   [520 ,521, 524, 527, 532]
Cluster 2   [528]
Cluster 3   [531]
Cluster 4   [525,526,533]
Cluster 5   [519,523,529,534,535]
Cluster 6   [530]

This is my approach in python.

clusters_order_id = []

df_clusters = df.groupby('Clusters')

for i in df_clusters['order_id']:
   clusters_order_id.append(i)

which gives me

clusters_order_id
Out[196]: 
0    (Cluster 1, [520, 521, 524, 527, 532])
1                        (Cluster 2, [528])
2                        (Cluster 3, [531])
3              (Cluster 4, [525, 526, 533])
4    (Cluster 5, [519, 523, 529, 534, 535])
5                        (Cluster 6, [530])

But I am not getting how to seperate above into series of above form. So that Cluster 1,Cluster 2 will become my index and corresponding order ids will be an array. Please help.

Upvotes: 3

Views: 1267

Answers (2)

jezrael
jezrael

Reputation: 862406

You can use groupby and tolist:

print df.groupby('Clusters')['order_id'].apply(lambda x: x.tolist())

Clusters
Cluster 1    [520, 521, 524, 527, 532]
Cluster 2                        [528]
Cluster 3                        [531]
Cluster 4              [525, 526, 533]
Cluster 5    [519, 523, 529, 534, 535]
Cluster 6                        [530]
Name: order_id, dtype: object

Timing:

In [153]: %timeit df.groupby('Clusters')['order_id'].apply(lambda x: x.tolist())
1000 loops, best of 3: 751 µs per loop

In [154]: %timeit df.pivot_table(index='Clusters', aggfunc=pd.Series.tolist)
100 loops, best of 3: 3.55 ms per loop

Upvotes: 1

Anton Protopopov
Anton Protopopov

Reputation: 31662

Another solution with pivot_table:

In [473]: df.pivot_table(index='Clusters', aggfunc=pd.Series.tolist)
Out[473]:
                            order_id
Clusters
Cluster 1  [520, 521, 524, 527, 532]
Cluster 2                      [528]
Cluster 3                      [531]
Cluster 4            [525, 526, 533]
Cluster 5  [519, 523, 529, 534, 535]
Cluster 6                      [530]

Upvotes: 2

Related Questions