Pandas keeps repeating rows when using read_excel

Question

I'm read_excel() to convert an xlsx spreadsheet into a dataframe. It usually works fine, but for this spreadsheet when I import it and try to list all the values with df["Date"] it seems to repeat every date many, many times. Any idea why this is?

    df = pd.read_excel("data.xlsx", sheet_name="Data")
    dates = df['Date']

The result:

    0        2014-12-22
    1        2014-12-22
    2        2014-12-22
    3        2014-12-22
    4        2014-12-22
                ...
    419532   2020-01-10
    419533   2020-01-10
    419534   2020-01-10
    419535   2020-01-10
    419536   2020-01-10

edit: A bit of the spreadsheet:

As you can see the dates are repeated, they definitely do not look like that in my spreadsheet. Any idea why? Thanks

Cohan · Accepted Answer

pd.read_excel() will read the entire sheet unless you constrain it to a limited range. Excel has a number of functions to alter the display of the sheet without changing what data exists in the sheet (filtering, hiding rows, etc.). So the extra rows are likey data that exists in the sheet but have been hidden from display. abs

Some tell tale signs are if a filter button has the filtered symbol in it, blue row numbers, and hidden rows as seen below.

Check to make sure that you can see all the data so you know what is in the worksheet and not what is just visible.

To get to the same view that you have, you can filter your dataframe with conditions.

df = df[df['city'] == 'CALGARY']

or multiple conditions:

df = df[(df['city'] == 'CALGARY') & (df['utility'] == 'Suncor')]

Pandas keeps repeating rows when using read_excel

Answers (1)

Related Questions