Reputation: 113
I read in excel files that are normally formatted like this below:
colA colB
0 0
1 1
and I can just write something like df = pd.read_excel(filename, skiprows=0)
which skips the column headers and ingests the data. However sometimes my data comes in as
some random text in the cells above
colA colB
0 0
1 1
where I would need to either delete that extra row manually then shift everything up so that the first row is made up of the column headers. Is there an elegant way to start the excel read at whatever row number colA is found so we skip any unnecessary entries or text above the colA and colB headers?
Upvotes: 1
Views: 1529
Reputation: 612
I don't quite understand your problem. It looks like you know about skip_rows
.
You could just pass a list of row number to do that.
skiprows : list-like, int or callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file.
For example,
rows_to_skip=[0,1,2] #skip first 3 rows of the file
df = pd.read_excel(filename, skiprows=rows_to_skip)
There is also a way to slightly simplify the process. Say, you don't know the exact line where your column headers are. You can use grep
to obtain this number in a terminal and just get rid of all rows before that.
For example,grep -n 'colA' filename
will return the line where that information is found along with a line number.You could easily than make a list to skip all preceding lines like this rows_to_skip=list(range(line_number))
. Not the best possible solution(memory-wise due to list
), but it should also work here.
Upvotes: 2
Reputation: 4660
Assuming you know the first column name (i.e. colA
in your example), and that this value will be present somewhere in the first column of data:
if df.columns[0] != "colA": # Check first if column name is incorrect.
# Get the first column of data:
first_col = df[df.columns[0]]
# Identify the row index where the value equals the column name:
header_row_index = first_col.loc[first_col == "colA"].index[0]
# Grab the column names:
column_names = df.loc[header_row_index]
# Reset the df to start below the new header row, and rename the columns:
df = df.loc[header_row_index+1:, :]
df.columns = column_names
Upvotes: 2