thePandasFriend
thePandasFriend

Reputation: 113

How to start reading in an excel file at a certain row based on a condition in Pandas

I read in excel files that are normally formatted like this below:

colA colB
   0    0
   1    1

and I can just write something like df = pd.read_excel(filename, skiprows=0)

which skips the column headers and ingests the data. However sometimes my data comes in as

some random text in the cells above
colA colB
   0    0
   1    1

where I would need to either delete that extra row manually then shift everything up so that the first row is made up of the column headers. Is there an elegant way to start the excel read at whatever row number colA is found so we skip any unnecessary entries or text above the colA and colB headers?

Upvotes: 1

Views: 1529

Answers (2)

Alex.Kh
Alex.Kh

Reputation: 612

I don't quite understand your problem. It looks like you know about skip_rows. You could just pass a list of row number to do that.

skiprows : list-like, int or callable, optional
        Line numbers to skip (0-indexed) or number of lines to skip (int)
        at the start of the file.

For example,

rows_to_skip=[0,1,2] #skip first 3 rows of the file
df = pd.read_excel(filename, skiprows=rows_to_skip)

There is also a way to slightly simplify the process. Say, you don't know the exact line where your column headers are. You can use grep to obtain this number in a terminal and just get rid of all rows before that.

For example,grep -n 'colA' filename will return the line where that information is found along with a line number.You could easily than make a list to skip all preceding lines like this rows_to_skip=list(range(line_number)). Not the best possible solution(memory-wise due to list), but it should also work here.

Upvotes: 2

Toby Petty
Toby Petty

Reputation: 4660

Assuming you know the first column name (i.e. colA in your example), and that this value will be present somewhere in the first column of data:

if df.columns[0] != "colA":  # Check first if column name is incorrect.
    # Get the first column of data:
    first_col = df[df.columns[0]]
    # Identify the row index where the value equals the column name:
    header_row_index = first_col.loc[first_col == "colA"].index[0]
    # Grab the column names:
    column_names = df.loc[header_row_index]
    # Reset the df to start below the new header row, and rename the columns:
    df = df.loc[header_row_index+1:, :]
    df.columns = column_names

Upvotes: 2

Related Questions