Reputation: 17
How to count unique row and its numbers of appearance in pandas
?
Lead ID bank_account_id NO.of account
0 308148.0 12460.0 1
1 310443.0 12654.0 1
2 310443.0 12655.0 1
3 312745.0 12835.0 1
4 312745.0 12836.0 1
5 312745.0 12837.0 1
6 312745.0 12838.0 1
7 312745.0 12839.0 1
8 313082.0 13233.0 1
9 314036.0 13226.0 1
10 314559.0 13271.0 1
11 314559.0 13273.0 1
12 316728.0 13228.0 1
13 316728.0 13230.0 1
14 316728.0 13232.0 1
15 316728.0 13234.0 1
16 316728.0 13235.0 1
17 316728.0 13272.0 1
18 318465.0 13419.0 1
19 318465.0 13420.0 1
20 318465.0 13421.0 1
21 318465.0 13422.0 1
22 318465.0 13423.0 1
23 318465.0 13424.0 1
24 318465.0 13425.0 1
25 321146.0 13970.0 1
26 321146.0 13971.0 1
27 321218.0 14779.0 1
28 321356.0 15142.0 1
29 321356.0 15144.0 1
30 321356.0 15146.0 1
In this dataset I want to get bank_account_id
corresponding to every unique Lead ID & total number of bank_account_id
every Lead ID
is having.
Upvotes: 0
Views: 91
Reputation: 21
Hy, try user a single df.value_counts()
you will receive a good aggregation table.
Lead ID bank_account_id NO.of account 321356.0 15146.0 1 1 316728.0 13232.0 1 1 310443.0 12654.0 1 1 12655.0 1 1
Upvotes: 1
Reputation: 12410
How about using MultiIndex for the count?
import pandas as pd
df = pd.DataFrame({'Lead ID': ['308148.0', '310443.0', '310443.0', '312745.0', '312745.0', '312745.0', '312745.0', '312745.0', '313082.0', '314036.0', '314559.0', '314559.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '321146.0', '321146.0', '321218.0', '321356.0', '321356.0', '321356.0'],
'bank_account_id': ['12460.0', '12654.0', '12655.0', '12835.0', '12836.0', '12837.0', '12838.0', '12839.0', '13233.0', '13226.0', '13271.0', '13273.0', '13228.0', '13230.0', '13232.0', '13234.0', '13235.0', '13272.0', '13419.0', '13420.0', '13421.0', '13422.0', '13423.0', '13424.0', '13425.0', '13970.0', '13971.0', '14779.0', '15142.0', '15144.0', '15146.0'],
'NO.of account': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']})
df2 = df.set_index(["Lead ID", "bank_account_id"])
print(df2.groupby(level="Lead ID").size())
Output:
Lead ID
308148.0 1
310443.0 2
312745.0 5
313082.0 1
314036.0 1
314559.0 2
316728.0 6
318465.0 7
321146.0 2
321218.0 1
321356.0 3
dtype: int64
Upvotes: 1
Reputation: 27567
Yo can use df.groupby()
:
import pandas as pd
df = pd.DataFrame({'Lead ID': ['308148.0', '310443.0', '310443.0', '312745.0', '312745.0', '312745.0', '312745.0', '312745.0', '313082.0', '314036.0', '314559.0', '314559.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '321146.0', '321146.0', '321218.0', '321356.0', '321356.0', '321356.0'],
'bank_account_id': ['12460.0', '12654.0', '12655.0', '12835.0', '12836.0', '12837.0', '12838.0', '12839.0', '13233.0', '13226.0', '13271.0', '13273.0', '13228.0', '13230.0', '13232.0', '13234.0', '13235.0', '13272.0', '13419.0', '13420.0', '13421.0', '13422.0', '13423.0', '13424.0', '13425.0', '13970.0', '13971.0', '14779.0', '15142.0', '15144.0', '15146.0'],
'NO.of account': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']})
df2 = df[df.duplicated('Lead ID', keep=False)].groupby('Lead ID')['bank_account_id'].apply(list).reset_index()
print(df2)
Output:
Lead ID bank_account_id
0 310443.0 [12654.0, 12655.0]
1 312745.0 [12835.0, 12836.0, 12837.0, 12838.0, 12839.0]
2 314559.0 [13271.0, 13273.0]
3 316728.0 [13228.0, 13230.0, 13232.0, 13234.0, 13235.0, ...
4 318465.0 [13419.0, 13420.0, 13421.0, 13422.0, 13423.0, ...
5 321146.0 [13970.0, 13971.0]
6 321356.0 [15142.0, 15144.0, 15146.0]
You can also use a for
loop to iterate through the values of your data frame with zip()
:
import pandas as pd
df = pd.DataFrame({'Lead ID': ['308148.0', '310443.0', '310443.0', '312745.0', '312745.0', '312745.0', '312745.0', '312745.0', '313082.0', '314036.0', '314559.0', '314559.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '316728.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '318465.0', '321146.0', '321146.0', '321218.0', '321356.0', '321356.0', '321356.0'],
'bank_account_id': ['12460.0', '12654.0', '12655.0', '12835.0', '12836.0', '12837.0', '12838.0', '12839.0', '13233.0', '13226.0', '13271.0', '13273.0', '13228.0', '13230.0', '13232.0', '13234.0', '13235.0', '13272.0', '13419.0', '13420.0', '13421.0', '13422.0', '13423.0', '13424.0', '13425.0', '13970.0', '13971.0', '14779.0', '15142.0', '15144.0', '15146.0'],
'NO.of account': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']})
dct = dict()
for l, b in zip(df['Lead ID'], df['bank_account_id']):
if l in dct:
dct[l].append(b)
else:
dct[l] = [b]
print(dct)
Output:
{'308148.0': ['12460.0'],
'310443.0': ['12654.0', '12655.0'],
'312745.0': ['12835.0', '12836.0', '12837.0', '12838.0', '12839.0'],
'313082.0': ['13233.0'],
'314036.0': ['13226.0'],
'314559.0': ['13271.0', '13273.0'],
'316728.0': ['13228.0', '13230.0', '13232.0', '13234.0', '13235.0', '13272.0'],
'318465.0': ['13419.0', '13420.0', '13421.0', '13422.0', '13423.0', '13424.0', '13425.0'],
'321146.0': ['13970.0', '13971.0'],
'321218.0': ['14779.0'],
'321356.0': ['15142.0', '15144.0', '15146.0']}
Upvotes: 2