Reputation: 2847

Return a list of lists from elements grouped by another column

I don't know how to describe my question, so I will just show an example here.

The data frame has two column A and B. I want to make it return a list like this.

[[3,4],[5,8,6],[6,8],[10,1]]

As you see, this is grouped by A and returns a list of numbers in B column. To be noticed, the order of elements in B DOESN'T CHANGE. Also the sub-lists have the same order as shown in column A. ([3,4] from group 1, [5,8,6] from group 2, etc)

Suppose dataframe is ordered by A already. I know how to use for loop to do it, but my dataset has 1 billion records. So I am looking for some efficient and clean code for this problem.

Upvotes: 3

Answers (4)

MrFun

Reputation: 2588

For large datasets I recommend using Numpy as it is faster.

I also recommend not using a for-loop to sort, df.sort_values() is way faster.

Here is a comparison of a 15 Million row dataset I worked on.

Via `Numpy`

import numpy as np
df = df[['a','b']]
    
keys, values = df.sort_values('a').values.T
ukeys, index = np.unique(keys, True)
arrays = np.split(values, index[1:])
df = pd.DataFrame({'a':ukeys,'b':[list(a) for a in arrays]})

NUMPY
Total time: 102.379 s for 15,397,309 rows
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3         1    1205208.0 1205208.0      1.2    
     4         1   60671365.0 60671365.0     59.3   
     5         1   16897187.0 16897187.0     16.5    
     6         1    1430774.0 1430774.0      1.4     
     7         1   22174794.0 22174794.0     21.7     
     8         1          4.0      4.0      0.0

With Pandas 'Groupby'

df.groupby('a')['b'].apply(list)

PANDAS GROUPBY
Total time: 146.23 s for 15,397,309 rows

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3         1    1181714.0 1181714.0      0.8     
     4         1  145048477.0 145048477.0     99.2      
     5         1          3.0      3.0      0.0

Upvotes: 1

Aaditya Ura

Reputation: 12669

There are different approaches :

data is:

with open('textrr','r') as f:
    data=[line.split() for line in f.readlines()]

first approach using collections.defaultdict()

d=collections.defaultdict(list)

for item in data:
    d[item[0]].append(item[1])

print([i for i in d.values() if i[0].isdigit()])

output:

order will be not same :

[['10', '1'], ['6', '8'], ['3', '4'], ['5', '8', '6']]

using itertools.grouby:

import itertools
print([[sub[1] for sub in i] for j,i in itertools.groupby(data,key=lambda x:x[0]) if list(j)[0].isdigit()])

output:

order will be same

[['3', '4'], ['5', '8', '6'], ['6', '8'], ['10', '1']]

At last if you don't want to use any import then you can try manual approach:

groupby={}

for item in data:
    if item[0].isdigit() and item[0] not in groupby:
        groupby[item[0]]=[item[1]]
    elif item[0].isdigit():
        groupby[item[0]].append(item[1])

print(groupby.values())

output:

[['10', '1'], ['3', '4'], ['6', '8'], ['5', '8', '6']]

Upvotes: 0

Alexander

Reputation: 109546

You first need to group on the first column A, then get the unique values in B (assuming you only want unique values and not duplicates). After you've done that, use a lambda expression to convert each of the np.array values to lists, and then convert the resulting series to a list using .tolist().

>>> df.groupby('A', sort=False)['B'].apply(list).tolist()

Or,

>>> [list(v) for v in df.groupby('A', sort=False)['B'].unique()]

Or,

>>> df.groupby('A', sort=False)['B'].apply(lambda x: x.unique().tolist()).tolist()

[[3, 4], [5, 8, 6], [6, 8], [10, 1]]

I would also recommend not sorting the groupby operation.

Here are some timing comparisons for anyone who is interested:

df_ = pd.concat([df] * 10000)  # Set-up larger dataframe with 90k rows.

%timeit df_.groupby('A', sort=False)['B'].unique().apply(list).tolist()
# 100 loops, best of 3: 5.9 ms per loop

%timeit df_.groupby('A', sort=False)['B'].apply(list).tolist()
# 100 loops, best of 3: 6.79 ms per loop

%timeit list(map(list, df_.groupby('A', sort=False)['B'].apply(list)))
# 100 loops, best of 3: 8.02 ms per loop

Upvotes: 2

jpp

Reputation: 164663

An alternative to @Alexander's solution is to apply list to each element of a groupby.apply(list) object.

In general, I would prefer this solution to a lambda based solution, which is just a loop.

res = list(map(list, df.groupby('A', sort=False)['B'].apply(list)))

Result:

[[3, 4], [5, 8, 6], [6, 8], [10, 1]]

Upvotes: 1

Return a list of lists from elements grouped by another column

Answers (4)

For large datasets I recommend using Numpy as it is faster.

Via Numpy

With Pandas 'Groupby'

Related Questions

Via `Numpy`