Walt Reed
Walt Reed

Reputation: 1466

Pandas - First & last rows based on condition

I'm trying to parse an Excel file. I have the file in a dataframe, and I want the header row to be based on a value (in this case, header starts where cell == 'Hour')

Additionally, I want the dataframe to end on the row BEFORE cell == 'All Fixed Periods'. I can't use where cell == '12:00 PM' since each Excel file will have varying ending time values.

This is what my dataframe looks like:

df1 = pd.DataFrame(
[['Business Dates','3/15/2017'],
['Locations','ABC Location'],
['Breakfast',5073.35],
['Lunch',1489.45],
['PM',129.93],
['Hour','Net Sales'],
['6:00 AM',11.56],
['7:00 AM',142.36],
['8:00 AM',153.54],
['9:00 AM',200.34],
['10:00 AM',240.33],
['11:00 AM',110.12],
['12:00 PM',114.11],
['All Fixed Periods',]],
columns=['Service Performance Summary','Sales'])

df1

Service Performance Summary Sales
0   Business Dates  3/15/2017
1   Locations   ABC Location
2   Breakfast   5073.35
3   Lunch       1489.45
4   PM Break    129.93
5   Hour    Net Sales
6   6:00 AM     11.56
7   6:00 AM     150.34
8   7:00 AM     142.36
9   8:00 AM     153.54
10  9:00 AM     200.34
11  10:00 AM    240.33
12  11:00 AM    110.12
13  12:00 PM    114.11
14  All Fixed Periods   None

into:

[['6:00 AM',11.56],
['7:00 AM',142.36],
['8:00 AM',153.54],
['9:00 AM',200.34],
['10:00 AM',240.33],
['11:00 AM',110.12],
['12:00 PM',114.11],],
columns=['Hour','Net Sales'])

Hour    Net Sales
1   6:00 AM     11.56
2   6:00 AM     150.34
3   7:00 AM     142.36
4   8:00 AM     153.54
5   9:00 AM     200.34
6   10:00 AM    240.33
7   11:00 AM    110.12
8   12:00 PM    114.11

The closest I've come is:

df.columns = df.iloc[5]
df = df[6:]

The only issue with this is that "Hour" might begin at a different index on different reports.

Upvotes: 3

Views: 3584

Answers (1)

piRSquared
piRSquared

Reputation: 294278

I'd use np.where to locate the first positions

vals = df1.values
col1 = vals[:, 0]
header_start = np.where(col1 == 'Hour')[0][0]
eodf = np.where(col1 == 'All Fixed Periods')[0][0]

pd.DataFrame(vals[header_start + 1:eodf], columns=vals[header_start])

       Hour Net Sales
0   6:00 AM     11.56
1   7:00 AM    142.36
2   8:00 AM    153.54
3   9:00 AM    200.34
4  10:00 AM    240.33
5  11:00 AM    110.12
6  12:00 PM    114.11

Could also use argmax

vals = df1.values
col1 = vals[:, 0]
header_start = (col1 == 'Hour').argmax()
eodf = (col1 == 'All Fixed Periods').argmax()

pd.DataFrame(vals[header_start + 1:eodf], columns=vals[header_start])

Upvotes: 1

Related Questions