Reputation: 881
I'm not python guy but I have to write something like this occasionally. I wrote this code few months back and it was serving the purpose without any errors. But today when I need to use this same script for some updated csv files. It is giving me some errors which I can't fix by my own. Please have a look at the code below following by the errors.
import pandas as pd
#import xlsxwriter
data_df = pd.read_excel("New2020Snap.xlsx")
data_df['MaxDate'] = data_df.groupby(['LeadId', 'LeadStatus'])['CreatedDate'].transform('max')
data_df['MinDate'] = data_df.groupby(['LeadId', 'LeadStatus'])['CreatedDate'].transform('min')
data_df['Difference'] = pd.to_datetime(data_df['MaxDate']) - pd.to_datetime(data_df['MinDate'])
agg_df = data_df.groupby(['LeadId', 'LeadStatus', 'Email']).agg(MaxDate=('CreatedDate', 'max'),
MinDate=('CreatedDate', 'min')).reset_index()
agg_df['Difference'] = pd.to_datetime(agg_df['MaxDate']) - pd.to_datetime(agg_df['MinDate'])
#data_df.to_json(orient='records')
with pd.ExcelWriter('../out/ComputedReport.xlsx', engine='XlsxWriter') as writer:
data_df.to_excel(writer, sheet_name='New Computed Data', index=False)
agg_df.to_excel(writer, sheet_name='Computed Agg Data', index=False)
print(data_df)
Following are the errors I'm receving from running this above script.
Traceback (most recent call last):
File "C:\Users\w-s\IdeaProjects\PythonForEverybody\src\pandas_read_opps.py", line 6, in <module>
data_df['MaxDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('max')
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\generic.py", line 511, in transform
result = getattr(self, func)(*args, **kwargs)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1559, in max
return self._agg_general(
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1017, in _agg_general
result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\generic.py", line 255, in aggregate
return self._python_agg_general(
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1094, in _python_agg_general
return self._python_apply_general(f, self._selected_obj)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 892, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, data, self.axis)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\ops.py", line 213, in apply
res = f(group)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1062, in <lambda>
f = lambda x: func(x, *args, **kwargs)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1017, in <lambda>
result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
File "<__array_function__ internals>", line 5, in amax
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\fromnumeric.py", line 2705, in amax
return _wrapreduction(a, np.maximum, 'max', axis, None, out,
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\fromnumeric.py", line 85, in _wrapreduction
return reduction(axis=axis, out=out, **passkwargs)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 11468, in stat_func
return self._reduce(
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 4248, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\nanops.py", line 129, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\nanops.py", line 873, in reduction
result = getattr(values, meth)(axis)
File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\_methods.py", line 39, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
TypeError: '>=' not supported between instances of 'datetime.datetime' and 'str'
Process finished with exit code 1
So basically I was working on two separate copies of the same code with slight changes in each of them. However the one which I was able to fix is pasted below. And I made the changes as suggested by the first comment under my question, by Sir Guillaume Ansanay-Alex. And the exact line of code was suggested by the answer which I'm going to mark as correct after this edit. So the following errors was there in the code.
So the working copy for my code is as below.
import pandas as pd
#import xlsxwriter
data_df = pd.read_excel("OppAvgStageDuration.xlsx")
#suggested by the first comment and answered by the accepted one.
data_df['CloseDate'] = pd.to_datetime(data_df['CloseDate'])
data_df['MaxDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('max')
data_df['MinDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('min')
data_df['Difference'] = pd.to_datetime(data_df['MaxDate']) - pd.to_datetime(data_df['MinDate'])
agg_df = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage']).agg(MaxDate=('CloseDate', 'max'),
MinDate=('CloseDate', 'min')).reset_index()
agg_df['Difference'] = pd.to_datetime(agg_df['MaxDate']) - pd.to_datetime(agg_df['MinDate'])
#data_df.to_json(orient='records')
with pd.ExcelWriter('../out/ComputedReportOpps.xlsx', engine='xlsxwriter') as writer:
data_df.to_excel(writer, sheet_name='New Computed Data', index=False)
agg_df.to_excel(writer, sheet_name='Computed Agg Data', index=False)
print(data_df)
Upvotes: 0
Views: 119
Reputation: 41427
Currently you're converting the derived MaxDate
and MinDate
columns to_datetime()
, but try converting the source CreatedDate
column to_datetime()
from the start:
data_df = pd.read_excel("New2020Snap.xlsx")
data_df['CreatedDate'] = pd.to_datetime(data_df['CreatedDate'])
If that doesn't work then I think it's what Guillaume commented, that it has mixed format.
Upvotes: 1