grumpypig
grumpypig

Reputation: 59

Converting to Excel "Date" format (within Excel file) using python and pandas from another date format from html table

I am new to python and exploring to get data from excel using it and found pandas library to get data

I need to get the rates from a HTML table on a website. Table from which the data has to be read Then dump it in an excel file. I am using Python I have used the following code

import pandas as pd
from datetime import datetime
import lxml as lx
import openpyxl as oxl
url = "https://www.example.com"

tables = pd.read_html(url)
table = tables[0]
table.to_excel('output.xlsx')

The dates are in dd mmm yyyy format in the 'Effective Date' column

I would like to convert them to the dd/mm/yyyy format

I used the following code to convert the table

['Effective Date'] = pd.to_datetime(table['Effective Date'],
                                        infer_datetime_format=False, format='%d/%m/%Y', errors='ignore')

but it fails to convert the dates in the column. Could someone head me in some proper direction please.

Here is the complete code

import pandas as pd
import html5lib
import datetime
import locale
import pytz
import lxml as lx
import openpyxl as oxl

url = "https://www.rba.gov.au/statistics/cash-rate/"

tables = pd.read_html(url)

table = tables[0]

table['Effective Date'] = pd.to_datetime(table['Effective Date'],
                                        infer_datetime_format=False, format='%d/%m/%Y', errors='ignore')

table.to_excel('rates.xlsx')

Upvotes: 6

Views: 39345

Answers (1)

David Erickson
David Erickson

Reputation: 16683

You need to use pd.ExcelWriter to create a writer object, so that you can change to Date format WITHIN Excel; however, this problem has a couple of different aspects to it:

  1. You have non-date values in your date column, including "Legend:", "Cash rate decreased", "Cash Rate increased", and "Cash rate unchanged".
  2. As mentioned in the comments, you must pass format='%d %b %Y' to pd.to_datetime() as that is the Date format you are converting FROM.
  3. You must pass errors='coerce' in order to return NaT for those that don't meet the specified format
  4. For the pd.to_datetime() line of code, you must add .dt.date at the end, because we use a date_format parameter and not a datetime_format parameter in creating the writer object later on. However, you could also exclude dt.date and change the format of the datetime_format parameter.
  5. Then, do table = table.dropna() to drop rows with any columns with NaT
  6. Pandas does not change the Date format WITHIN Excel. If you want to do that, then you should use openpyxl and create a writer object and pass the date_format. In case someone says this, you CANNOT simply do: pd.to_datetime(table['Effective Date'], format='%d %b %Y', errors='coerce').dt.strftime('%m/%d/%y') or .dt.strftime('%d/%m/%y'), because that creates a "General" date format in EXCEL.
  7. Output is ugly if you do not widen your columns, so I've included code for that as well. Please note that I am on a USA locale, so passing d/m/yyyy creates a "Custom" format in Excel.

NOTE: In my code, I have to pass m/d/yyyy in order for a "Date" format to appear in EXCEL. You can simply change to date_format='d/m/yyyy' since my computer has a different locale than you (USA) that Excel utilizes for "Date" format.

Source + More on this topic:


import pandas as pd
import html5lib
import datetime
import locale
import pytz
import lxml as lx
import openpyxl as oxl

url = "https://www.rba.gov.au/statistics/cash-rate/"

tables = pd.read_html(url)

table = tables[0]

table['Effective Date'] = pd.to_datetime(table['Effective Date'], format='%d %b %Y', errors='coerce').dt.date
table = table.dropna()
table.to_excel('rates.xlsx')

writer = pd.ExcelWriter("rates.xlsx",
                        engine='xlsxwriter',
                        date_format='m/d/yyyy')

# Convert the dataframe to an XlsxWriter Excel object.
table.to_excel(writer, sheet_name='Sheet1')

# Get the xlsxwriter workbook and worksheet objects in order to set the column
# widths, to make the dates clearer.
workbook  = writer.book
worksheet = writer.sheets['Sheet1']
worksheet.set_column('B:E', 20)

# Close the Pandas Excel writer and output the Excel file.
writer.save()

enter image description here

Upvotes: 8

Related Questions