Reputation: 81
In the line of the following question, which doesn't seem to have an answer yet.
Read dates from excel to Pandas Dataframe
On European machines Pandas has a confusing bug while parsing dates from an Excelsheet with a european format (dd-mm-yyyy). Dates with a day number from 1-12 will automatically be converted to the american standard (mm-dd-yyyy) while dates with a day number > 12 are parsed in a European way (dd-mm-yyyy). This obviously leads to problems.
There is always a solution to post process the dates and switch them around if both 'day' and 'month' are < 13, but that doesn't seem to be the way it suppose to work. Has anyone found a better solution? Thanks in advance!
python: '3.4.3 |Anaconda 2.1.0 (x86_64)| (default, Oct 20 2015, 14:27:51) \n[GCC 4.2.1 (Apple Inc. build 5577)]
Pandas: '0.17.0'
EDIT 17 nov 2015
Found a workaround/solution myself: included dayfirst=True with to_datetime()
It still seems like a bug to me. I added a simplified version of my code to give some more context. The script reads an excelsheet with personal data and converts to create a new sheet that can be used for server upload. The input can vary quite a lot, but I simplified the example.
Added my solution in the code and let it make 2 date outputs: one with and one without dayfirst=True
Ran the code on two different excel sheets. One had no problem at all (the xlsx file, example 2) and the other (xls, example 1) had a difference between the columns. It seems like pandas correctly recognizes day and month, but has difficulty creating a string from a date and mixes up the order automatically in ipython output.
Final list for xls file, see the problem with Name 4
Final list for xlsx file, no problem with name 3
# Module for test list
path = "xxxx"
namefile = "testlist 1.xls"
#namefile = "testlist 2.xlsx"
schoolnaam = 'schoolname'
BRIN = 'XXXX'
meetperiode = 'MPX'
meetjaar = '20xx/20xx'
os.chdir(path)
df = pd.DataFrame()
df = pd.read_excel(namefile,0, header = None, parse_dates = True)
df1 = pd.DataFrame()
df1 = df
df1.columns = ['Leerlingnummer', 'Achternaam', 'Geslacht', 'Blank', 'Leerjaar', 'Gebdatum']
df1 = df1[['Leerlingnummer', 'Achternaam', 'Geslacht', 'Gebdatum']]
# Sheet Leerling
df1.loc[df1['Leerlingnummer'].str.contains('Groep|/|A|B|C|D|E|F|G|H|I|J', na=False), 'Naam groep'] = df1.Leerlingnummer
df1['Naam groep'] = df1['Naam groep'].ffill()
df1.dropna(thresh=5, inplace = True)
df1['Achternaam'] = df1['Achternaam'].str.strip()
df1['Geslacht'] = df1['Geslacht'].str.strip().str.upper()
df1['Naam groep'] = df1['Naam groep'].str.strip()
df1['Voornaam'] = np.nan
df1['Tussenvoegsel'] = np.nan
df1['Geboortedatum']= pd.to_datetime(df1.Gebdatum).apply(lambda x: x.strftime('%d-%m-%Y'))
df1['Geboortedatum2']= pd.to_datetime(df1.Gebdatum, dayfirst=True).apply(lambda x: x.strftime('%d-%m-%Y'))
dfLeerling = df1[['Achternaam','Voornaam','Tussenvoegsel','Geslacht','Geboortedatum','Geboortedatum2','Naam groep']]
# Sheet Groep
gb = df1.groupby('Naam groep')
klaslijst = list(gb.groups)
klaslijst.sort()
dfGroep = pd.DataFrame(data = klaslijst, columns=['Naam groep'])
dfGroep['Lesjaar'] = meetjaar
dfGroep['Naam leraar'] = np.nan
dfGroep['Opmerkingen'] = np.nan
# Sheet School
dfSchool = pd.DataFrame({'BRIN': BRIN, 'Naam school': schoolnaam, 'Adres':[np.nan], 'Postcode':[np.nan], 'Plaats':[np.nan],
'Telefoon':[np.nan], 'Email':[np.nan], 'Website':[np.nan]})
dfSchool = dfSchool[['BRIN','Naam school','Adres','Postcode','Plaats','Telefoon','Email','Website']]
# Writer
namefile2 = 'Final list %s %s.xlsx' % (schoolnaam, meetperiode)
writer = pd.ExcelWriter(namefile2)
dfSchool.to_excel(writer, 'School', index=False)
dfGroep.to_excel(writer, 'Groep', index=False)
dfLeerling.to_excel(writer, 'Leerling', index=False)
writer.save()
dfLeerling.head()
Upvotes: 8
Views: 1548
Reputation: 13
When this happens I create the dataframe already forcing the type to str for the date columns so it doesn't get anything interpreted
dtype={'x':'str','y':'str'}
After that you can use the to_datetime() method pointing out the format you want
format='%d/%m/%Y'
Upvotes: 1