Reputation: 1317
I've written some code that takes an input list of "order" objects that looks like this...
**Company, theDate, Product, ProductReceived**
Apple, 2020-10-01, Subscription, 0
Apple, 2020-10-01, Trial, 0
Apple, 2020-11-01, Subscription, 0
Apple, 2020-11-01, Trial, 1
Apple, 2020-12-01, Subscription, 1
Apple, 2020-12-01, Trial, 0
Apple, 2021-01-01, Subscription, 1
Apple, 2021-01-01, Trial, 1
Apple, 2021-02-01, Subscription, 1
Apple, 2021-02-01, Trial, 1
Apple, 2021-03-01, Subscription, 0
Apple, 2021-03-01, Trial, 1
And turns it into a simple String output that looks like this...
[ ] 2020.10 - 2020.12
[x] 2021.01 - 2021.02
[ ] 2021.03
The idea being that it's grouped all dates in continuous ranges based on whether ALL products have been received for a given date.
The code works but feels extremely clunky and I have a suspicion that python/pandas probably offers near out-the-box solutions to a task like this. Would really appreciate any thoughts on the most elegant solution since this code (which is simplified from my real world example) is already becoming burdensome to extend. Here's a full working example...
class orderObject:
def __init__(self,company, theDate, product, productReceived):
self.company = company
self.theDate = theDate
self.product = product
self.productReceived = productReceived
orderList = [orderObject('Apple','2020-10-01','Subscription',0)
, orderObject('Apple','2020-10-01','Trial',0)
, orderObject('Apple','2020-11-01','Subscription',0)
, orderObject('Apple','2020-11-01','Trial',1)
, orderObject('Apple','2020-12-01','Subscription',1)
, orderObject('Apple','2020-12-01','Trial',0)
, orderObject('Apple','2021-01-01','Subscription',1)
, orderObject('Apple','2021-01-01','Trial',1)
, orderObject('Apple','2021-02-01','Subscription',1)
, orderObject('Apple','2021-02-01','Trial',1)
, orderObject('Apple','2021-03-01','Subscription',0)
, orderObject('Apple','2021-03-01','Trial',1)
]
dateSet = {}
for o in orderList:
dateSet[o.theDate] = 1 #loop through once to assume every date is complete
for o in orderList: #loop through a second time, and take any incomplete date/product as a fail (i.e. all products must be complete for the date to be complete)
if o.productReceived < 1:
dateSet[o.theDate] = 0
#now dateSet contains references to every date and whether we have received products for ALL products at that date, e.g. attr: 2020-10-01, value: 0 etc
def updateCompanyOrderGrouped(lcGroup):
indexText = startGroup[:-3].replace("-",".")
if (startGroup!=endGroup):
indexText = indexText + " - " + endGroup[:-3].replace("-",".")
if (lastValue==0):
lcGroup += "\n[ ] " + indexText
else:
lcGroup += "\n[x] " + indexText
return lcGroup
companyOrderGrouped = "" #string to store grouping result
lastValue = -1 #start with a "last value" that won't match any current value
startGroup = endGroup = "NA" #start with startGroup and endGroup that won't match any current groups
for attr, value in dateSet.items():
if value != lastValue:
if startGroup != "NA":
#we just ended a grouping because the value changed...
companyOrderGrouped = updateCompanyOrderGrouped(companyOrderGrouped)
startGroup = attr
#whatever happens, update the endGroup + value to keep track of the last record
endGroup = attr
lastValue = value
print(updateCompanyOrderGrouped(companyOrderGrouped).lstrip('\n')) #finish with last call to updateCompanyOrderGrouped to end the final group
Upvotes: 1
Views: 243
Reputation: 31329
Something like this is probably what you're after:
from pandas import DataFrame
order_columns = ['company', 'date', 'product', 'received']
order_data = [
['Apple', '2020-10-01', 'Subscription', 0],
['Apple', '2020-10-01', 'Trial', 0],
['Apple', '2020-11-01', 'Subscription', 0],
['Apple', '2020-11-01', 'Trial', 1],
['Apple', '2020-12-01', 'Subscription', 1],
['Apple', '2020-12-01', 'Trial', 0],
['Apple', '2021-01-01', 'Subscription', 1],
['Apple', '2021-01-01', 'Trial', 1],
['Apple', '2021-02-01', 'Subscription', 1],
['Apple', '2021-02-01', 'Trial', 1],
['Apple', '2021-03-01', 'Subscription', 0],
['Apple', '2021-03-01', 'Trial', 1]
]
df = DataFrame(order_data, columns=order_columns)
# create an extra column all_received that is 0 or 1 for all products received on a date
df['all_received'] = df.groupby(by='date')['received'].transform('all').astype(int)
# grouping that uses a temporary series changing value every time all_received changes
grouping = df.groupby([(df.all_received != df.all_received.shift()).cumsum()])
# from that grouping, the value and date of every first + date of every last
result = [(last.all_received, first.date[:-3].replace("-", "."), last.date[:-3].replace("-", "."))
for first, last in zip(grouping.nth(0).itertuples(), grouping.nth(-1).itertuples())]
# print similar to your format:
for value, start, end in result:
print(f'[{"x" if value else " "}] {start}{" - " + end if end != start else ""}')
So, apart from the definition of the data, you're right: it's three lines of code, although I'd argue your solution is a bit more readable. It takes a while to read these dense lines of pandas
code and realise what is going on.
Note that I renamed some of the variable to be all lowercase with underscores, which is the recommended naming convention for Python.
Output:
[ ] 2020.10 - 2020.12
[x] 2021.01 - 2021.02
[ ] 2021.03
Since you commented that this was a bit beyond your pandas
understanding, here's some background:
df['all_received'] = df.groupby(by='date')['received'].transform('all').astype(int)
The .groupby
creates a group of records in df
by 'date'
, i.e. it groups all the records together that have the same date. It then selects the 'received'
column from the grouping and applies .transform('all')
to it, which creates a series which is either True
if all the values in a group are truthy (like 1
) or False
otherwise (i.e. if one or more are 0
). Finally, the .astype(int)
then turns those boolean values into integers (0
or 1
) again. The resulting series is assigned to a new all_received
column, which works as it still has the same number of records.
grouping = df.groupby([(df.all_received != df.all_received.shift()).cumsum()])
The inner bit df.all_received != df.all_received.shift()
is 1
if all_received
is the same value on the the current record as on the next (by comparing it with the same column, shifted by 1 position) or 0
otherwise. The resulting series is then cumulatively summed (i.e. [0, 1, 1, 0, 1]
would become [0, 1, 2, 2, 3]
). That means the resulting series corresponds with the grouping of 1
's and 0
's in all_received
without changing the order (like a .groupby
would if you just applied it to all_received
). A grouping of the original df
by that temporary series is then created, which is the grouping you're after.
result = [(last.all_received, first.date[:-3].replace("-", "."), last.date[:-3].replace("-", "."))
for first, last in zip(grouping.nth(0).itertuples(), grouping.nth(-1).itertuples())]
In this final bit, that grouping is used twice (which is why it was created separately). grouping.nth(0).itertuples()
will get you the first record of every group, as a tuple of values. Similarly, grouping.nth(-1).itertuples()
gets you the last record of every group. By zipping these iterables together, you get pairs of the first and last record of every group - exactly what's needed to create the output. The rest is just a normal list comprehension taking the formatted date of both first
and last
and the all_received
of last
(could have also used first
, they're in the same group, so it will be the same).
And of course finally the print statement produces the output in the format you required, although the date formatting has already happened in the previous step.
Upvotes: 1
Reputation: 223
Another option, using pandas and maintaining some readabliity for future changes/extensions:
df = pd.read_csv('pandas-groupby-date-ranges.csv') # original data
df['theDate'] = pd.to_datetime(df['theDate'], infer_datetime_format=True)
df['year'] = df['theDate'].dt.year
df['month'] = df['theDate'].dt.month
aggregate_df = pd.DataFrame()
for name, group in df.groupby(['year', 'month']):
group['all_received'] = group['ProductReceived'].all()
aggregate_df = pd.concat([aggregate_df, group])
aggregate_df['group'] = (
aggregate_df['all_received'].ne(aggregate_df['all_received'].shift(1)).cumsum()
)
for name, group in aggregate_df.groupby('group'):
group_min = group['theDate'].min()
group_max = group['theDate'].max()
# output to desired format
x = '[x] ' if group['all_received'].iloc[0] == True else '[ ] '
if group_min != group_max:
print(
x
+ str(group_min.year)
+ "."
+ str(group_min.month)
+ ' - '
+ str(group_max.year)
+ "."
+ str(group_max.month)
)
else:
print(x + str(group_min.year) + "." + str(group_min.month))
Output:
[ ] 2020.10 - 2020.12
[x] 2021.1 - 2021.2
[ ] 2021.3
Upvotes: 1