Superman
Superman

Reputation: 881

Python pandas error while reading and writing csv file

I'm not python guy but I have to write something like this occasionally. I wrote this code few months back and it was serving the purpose without any errors. But today when I need to use this same script for some updated csv files. It is giving me some errors which I can't fix by my own. Please have a look at the code below following by the errors.

import pandas as pd
#import xlsxwriter

data_df = pd.read_excel("New2020Snap.xlsx")

data_df['MaxDate'] = data_df.groupby(['LeadId', 'LeadStatus'])['CreatedDate'].transform('max')
data_df['MinDate'] = data_df.groupby(['LeadId', 'LeadStatus'])['CreatedDate'].transform('min')
data_df['Difference'] = pd.to_datetime(data_df['MaxDate']) - pd.to_datetime(data_df['MinDate'])

agg_df = data_df.groupby(['LeadId', 'LeadStatus', 'Email']).agg(MaxDate=('CreatedDate', 'max'),
                                                                MinDate=('CreatedDate', 'min')).reset_index()
agg_df['Difference'] = pd.to_datetime(agg_df['MaxDate']) - pd.to_datetime(agg_df['MinDate'])

#data_df.to_json(orient='records')

with pd.ExcelWriter('../out/ComputedReport.xlsx', engine='XlsxWriter') as writer:
    data_df.to_excel(writer, sheet_name='New Computed Data', index=False)
    agg_df.to_excel(writer, sheet_name='Computed Agg Data', index=False)

print(data_df)

Following are the errors I'm receving from running this above script.

Traceback (most recent call last):
  File "C:\Users\w-s\IdeaProjects\PythonForEverybody\src\pandas_read_opps.py", line 6, in <module>
    data_df['MaxDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('max')
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\generic.py", line 511, in transform
    result = getattr(self, func)(*args, **kwargs)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1559, in max
    return self._agg_general(
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1017, in _agg_general
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\generic.py", line 255, in aggregate
    return self._python_agg_general(
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1094, in _python_agg_general
    return self._python_apply_general(f, self._selected_obj)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 892, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\ops.py", line 213, in apply
    res = f(group)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1062, in <lambda>
    f = lambda x: func(x, *args, **kwargs)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1017, in <lambda>
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
  File "<__array_function__ internals>", line 5, in amax
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\fromnumeric.py", line 2705, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\fromnumeric.py", line 85, in _wrapreduction
    return reduction(axis=axis, out=out, **passkwargs)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 11468, in stat_func
    return self._reduce(
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 4248, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\nanops.py", line 129, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\nanops.py", line 873, in reduction
    result = getattr(values, meth)(axis)
  File "C:\Users\w-s\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\_methods.py", line 39, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
TypeError: '>=' not supported between instances of 'datetime.datetime' and 'str'

Process finished with exit code 1

So basically I was working on two separate copies of the same code with slight changes in each of them. However the one which I was able to fix is pasted below. And I made the changes as suggested by the first comment under my question, by Sir Guillaume Ansanay-Alex. And the exact line of code was suggested by the answer which I'm going to mark as correct after this edit. So the following errors was there in the code.

  1. One of my csv column contains improperlyformatted date/time data.
  2. The Keyword engine=XlsxWriter wasn't in the right case which was working fine few months back. It is now working as engine=xlsxwriter, in all small case.

So the working copy for my code is as below.

import pandas as pd
#import xlsxwriter

data_df = pd.read_excel("OppAvgStageDuration.xlsx")
#suggested by the first comment and answered by the accepted one.
data_df['CloseDate'] = pd.to_datetime(data_df['CloseDate']) 

data_df['MaxDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('max')
data_df['MinDate'] = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage'])['CloseDate'].transform('min')
data_df['Difference'] = pd.to_datetime(data_df['MaxDate']) - pd.to_datetime(data_df['MinDate'])

agg_df = data_df.groupby(['OpportunityID', 'OpportunityName', 'ToStage']).agg(MaxDate=('CloseDate', 'max'),
                                                                              MinDate=('CloseDate', 'min')).reset_index()
agg_df['Difference'] = pd.to_datetime(agg_df['MaxDate']) - pd.to_datetime(agg_df['MinDate'])

#data_df.to_json(orient='records')

with pd.ExcelWriter('../out/ComputedReportOpps.xlsx', engine='xlsxwriter') as writer:
    data_df.to_excel(writer, sheet_name='New Computed Data', index=False)
    agg_df.to_excel(writer, sheet_name='Computed Agg Data', index=False)

print(data_df)

Upvotes: 0

Views: 119

Answers (1)

tdy
tdy

Reputation: 41427

Currently you're converting the derived MaxDate and MinDate columns to_datetime(), but try converting the source CreatedDate column to_datetime() from the start:

data_df = pd.read_excel("New2020Snap.xlsx")
data_df['CreatedDate'] = pd.to_datetime(data_df['CreatedDate'])

If that doesn't work then I think it's what Guillaume commented, that it has mixed format.

Upvotes: 1

Related Questions