Amon
Amon

Reputation: 2971

Pandas keeps repeating rows when using read_excel

I'm read_excel() to convert an xlsx spreadsheet into a dataframe. It usually works fine, but for this spreadsheet when I import it and try to list all the values with df["Date"] it seems to repeat every date many, many times. Any idea why this is?

    df = pd.read_excel("data.xlsx", sheet_name="Data")
    dates = df['Date']

The result:

    0        2014-12-22
    1        2014-12-22
    2        2014-12-22
    3        2014-12-22
    4        2014-12-22
                ...
    419532   2020-01-10
    419533   2020-01-10
    419534   2020-01-10
    419535   2020-01-10
    419536   2020-01-10

edit: A bit of the spreadsheet:

enter image description here As you can see the dates are repeated, they definitely do not look like that in my spreadsheet. Any idea why? Thanks

Upvotes: 1

Views: 388

Answers (1)

Cohan
Cohan

Reputation: 4564

pd.read_excel() will read the entire sheet unless you constrain it to a limited range. Excel has a number of functions to alter the display of the sheet without changing what data exists in the sheet (filtering, hiding rows, etc.). So the extra rows are likey data that exists in the sheet but have been hidden from display. abs

Some tell tale signs are if a filter button has the filtered symbol in it, blue row numbers, and hidden rows as seen below.

enter image description here

Check to make sure that you can see all the data so you know what is in the worksheet and not what is just visible.

To get to the same view that you have, you can filter your dataframe with conditions.

df = df[df['city'] == 'CALGARY']

or multiple conditions:

df = df[(df['city'] == 'CALGARY') & (df['utility'] == 'Suncor')]

Upvotes: 1

Related Questions