Reputation: 535
I want to group pandas dataframe column based on a condition that if the values are with in a range of +20. Below is the dataframe
{'Name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F'},
'ID': {0: 100, 1: 23, 2: 19, 3: 42, 4: 11, 5: 78},
'Left': {0: 70, 1: 70, 2: 70, 3: 70, 4: 66, 5: 66},
'Top': {0: 10, 1: 26, 2: 26, 3: 35, 4: 60, 5: 71}}
Here I want to group columns Left and Top. This is what I did:
df.groupby(['Top'],as_index=False).agg(lambda x: list(x))
This is the result I got :
{'Top': {0: 10, 1: 26, 2: 35, 3: 60, 4: 71},
'Name': {0: ['A'], 1: ['B', 'C'], 2: ['D'], 3: ['E'], 4: ['F']},
'ID': {0: [100], 1: [23, 19], 2: [42], 3: [11], 4: [78]},
'Left': {0: [70], 1: [70, 70], 2: [43], 3: [66], 4: [66]}}
Desired output:
{'Top': {0: [10, 26], 2: 35, 3: [60,71]},
'Name': {0: ['A', 'B', 'C'], 2: ['D'], 3: ['E', 'F']},
'ID': {0: [100, 23, 19], 2: [42], 3: [11, 78]},
'Left': {0: [70, 50, 87], 2: [43], 3: [66, 99]}}
NOTE:
An important thing to consider is that Top values 10 and 26 are in the range of 20, it forms a group. 35 should not be added to the group even though its difference between 26 and 35 are in the range of 20 because 10 and 20 are already in a group and the difference between 10(the least value in the group) and 35 is not in the range of 20.
Is there any any alternate way to solve this?
EDIT:
I have a different use-case for which the top values increase and when it moves to a new page the top value changes and starts increasing again. This goes on for different inputs. And finally I want to group by Input File Name, Page Number and group. How can I group these?
{'Input File Name': {0: 268441,
1: 268441,
2: 268441,
3: 268441,
4: 268441,
5: 268441,
6: 268441,
7: 268441,
8: 268441,
9: 268441,
10: 268441,
11: 268441,
12: 268441,
13: 268441,
14: 268441,
15: 268441,
16: 268441,
17: 268441,
18: 268441,
19: 268441,
20: 268441,
21: 268441,
22: 268441,
23: 268441,
24: 268441,
25: 268441,
26: 268441,
27: 268441,
28: 268441,
29: 268441,
30: 268441,
31: 268441,
32: 268441,
33: 268441,
34: 268441,
35: 268441,
36: 268441,
37: 268441,
38: 268441,
39: 268441},
'Page Number': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1,
12: 1,
13: 1,
14: 1,
15: 1,
16: 1,
17: 1,
18: 1,
19: 1,
20: 2,
21: 2,
22: 2,
23: 2,
24: 2,
25: 2,
26: 2,
27: 2,
28: 2,
29: 2,
30: 2,
31: 2,
32: 2,
33: 2,
34: 2,
35: 2,
36: 2,
37: 2,
38: 2,
39: 2},
'Content': {0: '3708 Forestview Road',
1: 'AvailableForLease&Sale',
2: '1,700± SFMedicalOffice',
3: '3708ForestviewRoad',
4: 'Suite107',
5: 'Raleigh,NC27612',
6: 'BuildingDescription',
7: '22,278± SFClassAOfficeBuilding',
8: 'OnlyOneSuiteLeft toLeaseand/orPurchase',
9: '(1)1,700± SFShell',
10: 'FlexibleLeaseTerms',
11: '2Floorsw/Elevator&Stairsto2',
12: 'Level',
13: 'nd',
14: 'ClassAFinishes',
15: 'On-SitePropertyManagement',
16: 'LargeGlass Windows',
17: '5:1Parking',
18: 'Formoreinformation,contact:',
19: 'OtherTenants: PivotPhysicalTherapy,TheLundy',
20: 'LeasingDetails',
21: 'SpaceDescription',
22: 'LeaseRate',
23: 'CompetitiveNNN+$5.50TICAM',
24: 'Tenant',
25: 'Suite107:1,700± SF',
26: 'Janitorial&Electric',
27: 'Responsibilities',
28: 'ShellSpacew/TIAllowance&Architecturals',
29: 'ClassABuilding',
30: 'SalePrice',
31: '$374,000or$220PSF',
32: 'BeautifulDouble-DoorEntry',
33: '1,700',
34: '± SF',
35: 'Size',
36: 'LargeGlassWindows',
37: 'ColdDarkShellw/TIAllowance',
38: '5:1Parking',
39: 'Upfit'},
'Top': {0: 6,
1: 6,
2: 49,
3: 103,
4: 103,
5: 103,
6: 590,
7: 637,
8: 656,
9: 676,
10: 695,
11: 716,
12: 716,
13: 717,
14: 736,
15: 755,
16: 775,
17: 794,
18: 813,
19: 835,
20: 111,
21: 138,
22: 142,
23: 142,
24: 169,
25: 174,
26: 179,
27: 190,
28: 195,
29: 216,
30: 217,
31: 217,
32: 238,
33: 247,
34: 247,
35: 248,
36: 259,
37: 274,
38: 282,
39: 285}}
Upvotes: 2
Views: 173
Reputation: 765
x1=df1.iloc[0,3]+20
def function1(ss:pd.Series):
global x1
if ss.Top>x1:
x1=ss.col1
else:
ss.col1=x1
return ss
df1.assign(col1=df1.Top.add(20)).apply(function1,1).groupby(['col1'], as_index=0, group_keys=0).agg(list).drop(['col1'],1)
group ID Left Name Top
0 1 [100, 23, 19] [70, 70, 70] [A, B, C] [10, 26, 26]
1 2 [42] [70] [D] [35]
2 3 [11, 78] [66, 66] [E, F] [60, 71]
Upvotes: 0
Reputation: 2897
You can write a function to group the Top
columns first and then use groupby
on that column:
import pandas as pd
df = pd.DataFrame({'Name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F'},
'ID': {0: 100, 1: 23, 2: 19, 3: 42, 4: 11, 5: 78},
'Left': {0: 70, 1: 70, 2: 70, 3: 70, 4: 66, 5: 66},
'Top': {0: 10, 1: 26, 2: 26, 3: 35, 4: 60, 5: 71}})
def group(l, group_range):
groups = []
current_group = []
i = 0
group_count = 1
while i < len(l):
a = l[i]
if len(current_group) == 0:
if i == len(l) - 1:
break
current_group_start = a
if a <= current_group_start + group_range:
current_group.append(group_count)
if a < current_group_start + group_range:
i += 1
else:
groups.extend(current_group)
current_group = []
group_count += 1
groups.extend(current_group)
return groups
#group(df['Top'],20) -> [1, 1, 1, 2, 3, 3]
df['group'] = group(df['Top'],20)
df.groupby(['group'],as_index=False).agg(list)
Output:
group ID Left Name Top
0 1 [100, 23, 19] [70, 70, 70] [A, B, C] [10, 26, 26]
1 2 [42] [70] [D] [35]
2 3 [11, 78] [66, 66] [E, F] [60, 71]
Upvotes: 2