Reputation: 2847
I don't know how to describe my question, so I will just show an example here.
A B
1 3
1 4
2 5
2 8
2 6
3 6
3 8
4 10
4 1
The data frame has two column A and B. I want to make it return a list like this.
[[3,4],[5,8,6],[6,8],[10,1]]
As you see, this is grouped by A
and returns a list of numbers in B
column. To be noticed, the order of elements in B
DOESN'T CHANGE. Also the sub-lists have the same order as shown in column A. ([3,4]
from group 1, [5,8,6]
from group 2, etc)
Suppose dataframe is ordered by A already. I know how to use for loop to do it, but my dataset has 1 billion records. So I am looking for some efficient and clean code for this problem.
Upvotes: 3
Views: 1512
Reputation: 2588
I also recommend not using a for-loop to sort, df.sort_values()
is way faster.
Here is a comparison of a 15 Million row dataset I worked on.
Numpy
import numpy as np
df = df[['a','b']]
keys, values = df.sort_values('a').values.T
ukeys, index = np.unique(keys, True)
arrays = np.split(values, index[1:])
df = pd.DataFrame({'a':ukeys,'b':[list(a) for a in arrays]})
NUMPY
Total time: 102.379 s for 15,397,309 rows
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 1 1205208.0 1205208.0 1.2
4 1 60671365.0 60671365.0 59.3
5 1 16897187.0 16897187.0 16.5
6 1 1430774.0 1430774.0 1.4
7 1 22174794.0 22174794.0 21.7
8 1 4.0 4.0 0.0
df.groupby('a')['b'].apply(list)
PANDAS GROUPBY
Total time: 146.23 s for 15,397,309 rows
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 1 1181714.0 1181714.0 0.8
4 1 145048477.0 145048477.0 99.2
5 1 3.0 3.0 0.0
Upvotes: 1
Reputation: 12669
There are different approaches :
data is:
with open('textrr','r') as f:
data=[line.split() for line in f.readlines()]
first approach using collections.defaultdict()
d=collections.defaultdict(list)
for item in data:
d[item[0]].append(item[1])
print([i for i in d.values() if i[0].isdigit()])
output:
order will be not same :
[['10', '1'], ['6', '8'], ['3', '4'], ['5', '8', '6']]
using itertools.grouby:
import itertools
print([[sub[1] for sub in i] for j,i in itertools.groupby(data,key=lambda x:x[0]) if list(j)[0].isdigit()])
output:
order will be same
[['3', '4'], ['5', '8', '6'], ['6', '8'], ['10', '1']]
At last if you don't want to use any import then you can try manual approach:
groupby={}
for item in data:
if item[0].isdigit() and item[0] not in groupby:
groupby[item[0]]=[item[1]]
elif item[0].isdigit():
groupby[item[0]].append(item[1])
print(groupby.values())
output:
[['10', '1'], ['3', '4'], ['6', '8'], ['5', '8', '6']]
Upvotes: 0
Reputation: 109546
You first need to group on the first column A
, then get the unique values in B
(assuming you only want unique values and not duplicates). After you've done that, use a lambda expression to convert each of the np.array values to lists, and then convert the resulting series to a list using .tolist()
.
>>> df.groupby('A', sort=False)['B'].apply(list).tolist()
Or,
>>> [list(v) for v in df.groupby('A', sort=False)['B'].unique()]
Or,
>>> df.groupby('A', sort=False)['B'].apply(lambda x: x.unique().tolist()).tolist()
[[3, 4], [5, 8, 6], [6, 8], [10, 1]]
I would also recommend not sorting the groupby operation.
Here are some timing comparisons for anyone who is interested:
df_ = pd.concat([df] * 10000) # Set-up larger dataframe with 90k rows.
%timeit df_.groupby('A', sort=False)['B'].unique().apply(list).tolist()
# 100 loops, best of 3: 5.9 ms per loop
%timeit df_.groupby('A', sort=False)['B'].apply(list).tolist()
# 100 loops, best of 3: 6.79 ms per loop
%timeit list(map(list, df_.groupby('A', sort=False)['B'].apply(list)))
# 100 loops, best of 3: 8.02 ms per loop
Upvotes: 2
Reputation: 164663
An alternative to @Alexander's solution is to apply list
to each element of a groupby.apply(list)
object.
In general, I would prefer this solution to a lambda
based solution, which is just a loop.
res = list(map(list, df.groupby('A', sort=False)['B'].apply(list)))
Result:
[[3, 4], [5, 8, 6], [6, 8], [10, 1]]
Upvotes: 1