d3wannabe
d3wannabe

Reputation: 1317

Efficient Python grouping of object list into date ranges based on property

I've written some code that takes an input list of "order" objects that looks like this...

**Company, theDate, Product, ProductReceived**
Apple, 2020-10-01, Subscription, 0
Apple, 2020-10-01, Trial, 0
Apple, 2020-11-01, Subscription, 0
Apple, 2020-11-01, Trial, 1
Apple, 2020-12-01, Subscription, 1
Apple, 2020-12-01, Trial, 0
Apple, 2021-01-01, Subscription, 1
Apple, 2021-01-01, Trial, 1
Apple, 2021-02-01, Subscription, 1
Apple, 2021-02-01, Trial, 1
Apple, 2021-03-01, Subscription, 0
Apple, 2021-03-01, Trial, 1

And turns it into a simple String output that looks like this...

[ ] 2020.10 - 2020.12
[x] 2021.01 - 2021.02
[ ] 2021.03

The idea being that it's grouped all dates in continuous ranges based on whether ALL products have been received for a given date.

The code works but feels extremely clunky and I have a suspicion that python/pandas probably offers near out-the-box solutions to a task like this. Would really appreciate any thoughts on the most elegant solution since this code (which is simplified from my real world example) is already becoming burdensome to extend. Here's a full working example...

class orderObject:
    def __init__(self,company, theDate, product, productReceived):
        self.company = company
        self.theDate = theDate  
        self.product = product  
        self.productReceived = productReceived

orderList = [orderObject('Apple','2020-10-01','Subscription',0)
              , orderObject('Apple','2020-10-01','Trial',0)
              , orderObject('Apple','2020-11-01','Subscription',0)
              , orderObject('Apple','2020-11-01','Trial',1)
              , orderObject('Apple','2020-12-01','Subscription',1)
              , orderObject('Apple','2020-12-01','Trial',0)
              , orderObject('Apple','2021-01-01','Subscription',1)
              , orderObject('Apple','2021-01-01','Trial',1)
              , orderObject('Apple','2021-02-01','Subscription',1)
              , orderObject('Apple','2021-02-01','Trial',1)
              , orderObject('Apple','2021-03-01','Subscription',0)
              , orderObject('Apple','2021-03-01','Trial',1)    
             ]

dateSet = {}
for o in orderList:
    dateSet[o.theDate] = 1 #loop through once to assume every date is complete
   
for o in orderList:  #loop through a second time, and take any incomplete date/product as a fail (i.e. all products must be complete for the date to be complete)
    if o.productReceived < 1:
        dateSet[o.theDate] = 0            
#now dateSet contains references to every date and whether we have received products for ALL products at that date, e.g. attr: 2020-10-01, value: 0 etc

def updateCompanyOrderGrouped(lcGroup):
    indexText = startGroup[:-3].replace("-",".")
    if (startGroup!=endGroup):
        indexText = indexText + " - " + endGroup[:-3].replace("-",".")
    if (lastValue==0):
        lcGroup += "\n[ ] " + indexText
    else:
        lcGroup += "\n[x] " + indexText
    return lcGroup

companyOrderGrouped = "" #string to store grouping result
lastValue = -1 #start with a "last value" that won't match any current value
startGroup = endGroup = "NA" #start with startGroup and endGroup that won't match any current groups

for attr, value in dateSet.items():
    if value != lastValue:
        if startGroup != "NA":
            #we just ended a grouping because the value changed...
            companyOrderGrouped = updateCompanyOrderGrouped(companyOrderGrouped)
        startGroup = attr
    #whatever happens, update the endGroup + value to keep track of the last record
    endGroup = attr
    lastValue = value

print(updateCompanyOrderGrouped(companyOrderGrouped).lstrip('\n')) #finish with last call to updateCompanyOrderGrouped to end the final group

Upvotes: 1

Views: 243

Answers (2)

Grismar
Grismar

Reputation: 31329

Something like this is probably what you're after:

from pandas import DataFrame

order_columns = ['company', 'date', 'product', 'received']
order_data = [
    ['Apple', '2020-10-01', 'Subscription', 0],
    ['Apple', '2020-10-01', 'Trial', 0],
    ['Apple', '2020-11-01', 'Subscription', 0],
    ['Apple', '2020-11-01', 'Trial', 1],
    ['Apple', '2020-12-01', 'Subscription', 1],
    ['Apple', '2020-12-01', 'Trial', 0],
    ['Apple', '2021-01-01', 'Subscription', 1],
    ['Apple', '2021-01-01', 'Trial', 1],
    ['Apple', '2021-02-01', 'Subscription', 1],
    ['Apple', '2021-02-01', 'Trial', 1],
    ['Apple', '2021-03-01', 'Subscription', 0],
    ['Apple', '2021-03-01', 'Trial', 1]
]
df = DataFrame(order_data, columns=order_columns)

# create an extra column all_received that is 0 or 1 for all products received on a date
df['all_received'] = df.groupby(by='date')['received'].transform('all').astype(int)
# grouping that uses a temporary series changing value every time all_received changes
grouping = df.groupby([(df.all_received != df.all_received.shift()).cumsum()])
# from that grouping, the value and date of every first + date of every last
result = [(last.all_received, first.date[:-3].replace("-", "."), last.date[:-3].replace("-", "."))
          for first, last in zip(grouping.nth(0).itertuples(), grouping.nth(-1).itertuples())]

# print similar to your format:
for value, start, end in result:
    print(f'[{"x" if value else " "}] {start}{" - " + end if end != start else ""}')

So, apart from the definition of the data, you're right: it's three lines of code, although I'd argue your solution is a bit more readable. It takes a while to read these dense lines of pandas code and realise what is going on.

Note that I renamed some of the variable to be all lowercase with underscores, which is the recommended naming convention for Python.

Output:

[ ] 2020.10 - 2020.12
[x] 2021.01 - 2021.02
[ ] 2021.03

Since you commented that this was a bit beyond your pandas understanding, here's some background:

df['all_received'] = df.groupby(by='date')['received'].transform('all').astype(int)

The .groupby creates a group of records in df by 'date', i.e. it groups all the records together that have the same date. It then selects the 'received' column from the grouping and applies .transform('all') to it, which creates a series which is either True if all the values in a group are truthy (like 1) or False otherwise (i.e. if one or more are 0). Finally, the .astype(int) then turns those boolean values into integers (0 or 1) again. The resulting series is assigned to a new all_received column, which works as it still has the same number of records.

grouping = df.groupby([(df.all_received != df.all_received.shift()).cumsum()])

The inner bit df.all_received != df.all_received.shift() is 1 if all_received is the same value on the the current record as on the next (by comparing it with the same column, shifted by 1 position) or 0 otherwise. The resulting series is then cumulatively summed (i.e. [0, 1, 1, 0, 1] would become [0, 1, 2, 2, 3]). That means the resulting series corresponds with the grouping of 1's and 0's in all_received without changing the order (like a .groupby would if you just applied it to all_received). A grouping of the original df by that temporary series is then created, which is the grouping you're after.

result = [(last.all_received, first.date[:-3].replace("-", "."), last.date[:-3].replace("-", "."))
          for first, last in zip(grouping.nth(0).itertuples(), grouping.nth(-1).itertuples())]

In this final bit, that grouping is used twice (which is why it was created separately). grouping.nth(0).itertuples() will get you the first record of every group, as a tuple of values. Similarly, grouping.nth(-1).itertuples() gets you the last record of every group. By zipping these iterables together, you get pairs of the first and last record of every group - exactly what's needed to create the output. The rest is just a normal list comprehension taking the formatted date of both first and last and the all_received of last (could have also used first, they're in the same group, so it will be the same).

And of course finally the print statement produces the output in the format you required, although the date formatting has already happened in the previous step.

Upvotes: 1

c_48
c_48

Reputation: 223

Another option, using pandas and maintaining some readabliity for future changes/extensions:

df = pd.read_csv('pandas-groupby-date-ranges.csv')  # original data

df['theDate'] = pd.to_datetime(df['theDate'], infer_datetime_format=True)
df['year'] = df['theDate'].dt.year
df['month'] = df['theDate'].dt.month

aggregate_df = pd.DataFrame()
for name, group in df.groupby(['year', 'month']):
    group['all_received'] = group['ProductReceived'].all()
    aggregate_df = pd.concat([aggregate_df, group])

aggregate_df['group'] = (
    aggregate_df['all_received'].ne(aggregate_df['all_received'].shift(1)).cumsum()
)

for name, group in aggregate_df.groupby('group'):
    group_min = group['theDate'].min()
    group_max = group['theDate'].max()

    # output to desired format
    x = '[x] ' if group['all_received'].iloc[0] == True else '[ ] '
    if group_min != group_max:
        print(
            x
            + str(group_min.year)
            + "."
            + str(group_min.month)
            + ' - '
            + str(group_max.year)
            + "."
            + str(group_max.month)
        )
    else:
        print(x + str(group_min.year) + "." + str(group_min.month))

Output:

[ ] 2020.10 - 2020.12
[x] 2021.1 - 2021.2
[ ] 2021.3

Upvotes: 1

Related Questions